Resilience: When Tools Fail

Agents depend on the real world. Real-world APIs go down, return 500 errors, or time out. A bad agent will crash. A Great Agent will realize the tool failed and try a different approach or inform the user politely.

1. Types of Agent Failures

Tool Error: The Lambda function crashed or timed out.
Hallucinated Tool: The agent tried to call a function that doesn't exist.
Logic Deadlock: The agent keeps calling the same tool and getting the same error (Looping).

2. Recovery Strategies

Max Iterations: In Bedrock, you can limit how many steps an agent can take (e.g., 5 steps max). This prevents expensive "infinite loops."
Error Feedback: If a tool fails, return a clear error message to the agent: "The weather API is currently offline. Please try a different source." This allows the agent's Reasoning to adapt.

3. Visualizing the Recovery

graph TD
    T[Thought: Call API] --> A[Action: Tool A]
    A -->|FAIL Error 500| O[Observation: Tool A is offline]
    O --> T2[Thought: Tool A is down. I will try Tool B instead.]
    T2 --> B[Action: Tool B]
    B -->|Success| Final[Goal Met]

4. Engineering Tip: Graceful Degradation

If your agent is doing 3 things and the 3rd one fails, should it delete the first 2?

Recommendation: Always design your agents to return a Partial Success.
"I found your flight, but the hotel booking system is down. I have saved the flight details for you."

Summary

Failure is a part of autonomous agent life.
Max Iterations prevent infinite loops and cost spikes.
Informative Error Messages help the agent's brain pivot to a new plan.
Partial Success is better than total failure for user satisfaction.

Module 12 Lesson 2: Handling Agent Failures