Module 12 Lesson 2: Handling Agent Failures
·AWS Bedrock

Module 12 Lesson 2: Handling Agent Failures

Resilient Autonomy. How to design agents that can recover from API errors and tool failures gracefully.

Resilience: When Tools Fail

Agents depend on the real world. Real-world APIs go down, return 500 errors, or time out. A bad agent will crash. A Great Agent will realize the tool failed and try a different approach or inform the user politely.

1. Types of Agent Failures

  • Tool Error: The Lambda function crashed or timed out.
  • Hallucinated Tool: The agent tried to call a function that doesn't exist.
  • Logic Deadlock: The agent keeps calling the same tool and getting the same error (Looping).

2. Recovery Strategies

  • Max Iterations: In Bedrock, you can limit how many steps an agent can take (e.g., 5 steps max). This prevents expensive "infinite loops."
  • Error Feedback: If a tool fails, return a clear error message to the agent: "The weather API is currently offline. Please try a different source." This allows the agent's Reasoning to adapt.

3. Visualizing the Recovery

graph TD
    T[Thought: Call API] --> A[Action: Tool A]
    A -->|FAIL Error 500| O[Observation: Tool A is offline]
    O --> T2[Thought: Tool A is down. I will try Tool B instead.]
    T2 --> B[Action: Tool B]
    B -->|Success| Final[Goal Met]

4. Engineering Tip: Graceful Degradation

If your agent is doing 3 things and the 3rd one fails, should it delete the first 2?

  • Recommendation: Always design your agents to return a Partial Success.
  • "I found your flight, but the hotel booking system is down. I have saved the flight details for you."

Summary

  • Failure is a part of autonomous agent life.
  • Max Iterations prevent infinite loops and cost spikes.
  • Informative Error Messages help the agent's brain pivot to a new plan.
  • Partial Success is better than total failure for user satisfaction.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn