Error Recovery Strategies

In a perfect world, your tools never crash, the LLM is always 100% accurate, and the internet never goes down. In the production world, everything fails.

The defining characteristic of a "Production" AI Engineer is their obsession with Defensive Design. You must assume that every component of your agent will fail at some point.

In this lesson, we will learn the strategies for detecting, containing, and recovering from errors in your agentic workflows.

1. Categorizing Agent Errors

Not all errors are created equal. We must handle them differently based on their root cause.

Error Type	Example	Recovery Strategy
Model Error	Rate limit, timeout, safety filter.	Retry with exponential backoff or switch models.
Tool Error	API 404, Database timeout, Invalid credentials.	Feed error back to LLM for correction or flag for human.
State Error	State object too large, Serialization failure. Prune history or trim metadata.
Logic Error	Infinite loop, Hallucination.	Max recursion limits / Verification nodes.

2. Strategy: The "Model Switcher" Fallback

What if OpenAI is down? If your agent is mission-critical, you cannot simply say "Sorry, try again later."

The Implementation

The primary node uses GPT-4o.
The code catches an openai.APIConnectionError.
The node automatically switches to an "Emergency" Claude 3.5 Sonnet instance.
The agent continues as if nothing happened.

def safe_llm_call(prompt):
    try:
        return openai_model.invoke(prompt)
    except Exception:
        # Emergency fallback to a different provider
        return anthropic_model.invoke(prompt)

3. Strategy: The "Defensive Observation"

When a tool fails, it often returns a messy stack trace or a raw JSON error. Sending this raw error to the LLM's brain is often overwhelming.

The Transformation Layer

Create a function that "Cleans" errors before the LLM sees them.

Raw Error: ERROR: 'User' object has no attribute 'email' at line 45 of db_connector.py
Cleaned Observation: The database lookup failed because the 'User' record exists but does not have an 'email' field. Please try searching for 'username' instead.

Result: You are "Guiding" the agent toward the fix.

4. Strategy: The "Zombie Agent" Killer

Sometimes an agent doesn't "Crash," but it gets lost. It starts repeating the same search or giving the same wrong answer.

Heartbeat and Progress Monitoring

Monitor the Change in State.

If the state["last_thought"] is identical to the state["thought_minus_two"], the agent is in a loop.
Recovery: Transition the graph to a Correction_Node that says: "You are stuck. Stop searching and try a completely different strategy or ask a human for help."

5. Strategy: Graceful Degradation

If a "Bonus" tool fails (e.g., a tool that generates a cute avatar for a user), don't kill the entire onboarding agent.

Soft Dependencies

Design your nodes so that some failures are "Non-Fatal."

Code: try...except...log_error_but_continue
Logic: The agent acknowledges the failure but proceeds with the primary task.

6. Real-World Recovery Flow: The Payment Agent

Step: Agent tries to call Charge_Card.
Error: Insufficent Funds.
Recovery:
- Agent searches for a backup payment method in the DB.
- If found, it retries with the new method.
- If not found, it sends a message to the user: "Transaction failed. Please add a new card."
- Crucial: The agent stays in context and waits for the user to add the card, then resumes.

Summary and Mental Model

Think of Error Recovery as Automotive Safety Features.

Retries are like ABS brakes: they help you stop and start without sliding.
Failbacks are like a spare tire: slow but keeps you moving.
Guardrails are like airbags: they prevent a total destruction when a crash is unavoidable.

You don't build an agent that never fails; you build an agent that never gives up.

Exercise: Failure Scenario

Scenario: You are building an agent that scrapes news from the web. The website you are scraping has blocked your IP.
- How would you design a "Proxy Switcher" recovery node?
- How would you feed this "Blocking" info back to the LLM to change its search strategy?
Technical: Why is pydantic.ValidationError the most common and "Best" error to catch in an agentic loop?
Architecture: At which point in the "Lifecycle" (Module 4.3) would you put the "Model Switcher" logic? In the Node itself, or in a separate wrapper? Ready to see how all this structure fits together? Module 6: LangGraph properly.

Failing Gracefully: Error Recovery Strategies