Module 12 Lesson 2: Handling Agent Failures
Resilient Autonomy. How to design agents that can recover from API errors and tool failures gracefully.
Resilience: When Tools Fail
Agents depend on the real world. Real-world APIs go down, return 500 errors, or time out. A bad agent will crash. A Great Agent will realize the tool failed and try a different approach or inform the user politely.
1. Types of Agent Failures
- Tool Error: The Lambda function crashed or timed out.
- Hallucinated Tool: The agent tried to call a function that doesn't exist.
- Logic Deadlock: The agent keeps calling the same tool and getting the same error (Looping).
2. Recovery Strategies
- Max Iterations: In Bedrock, you can limit how many steps an agent can take (e.g., 5 steps max). This prevents expensive "infinite loops."
- Error Feedback: If a tool fails, return a clear error message to the agent: "The weather API is currently offline. Please try a different source." This allows the agent's Reasoning to adapt.
3. Visualizing the Recovery
graph TD
T[Thought: Call API] --> A[Action: Tool A]
A -->|FAIL Error 500| O[Observation: Tool A is offline]
O --> T2[Thought: Tool A is down. I will try Tool B instead.]
T2 --> B[Action: Tool B]
B -->|Success| Final[Goal Met]
4. Engineering Tip: Graceful Degradation
If your agent is doing 3 things and the 3rd one fails, should it delete the first 2?
- Recommendation: Always design your agents to return a Partial Success.
- "I found your flight, but the hotel booking system is down. I have saved the flight details for you."
Summary
- Failure is a part of autonomous agent life.
- Max Iterations prevent infinite loops and cost spikes.
- Informative Error Messages help the agent's brain pivot to a new plan.
- Partial Success is better than total failure for user satisfaction.