Retry and Fallback: Building Resilient Agents

In a typical Python script, an API call failing (Error 500 or 429) usually means the program stops. In an agentic system, we can use Retry and Fallback Nodes to keep the agent alive by trying a different path or a different model.

1. The Retry Pattern (In-Node)

Sometimes, the model just needs one more try. You can wrap your LLM call in a simple retry loop using libraries like tenacity.

from tenacity import retry, stop_after_attempt

@retry(stop=stop_after_attempt(3))
def call_llm(messages):
    return llm.invoke(messages)

2. The Fallback Pattern (Graph-Based)

What if GPT-4o is down globally? Or what if your Search API is rate-limited? We create a Secondary Node.

graph TD
    Primary[Primary: GPT-4o Node] --> Success{Success?}
    Success -- No --> Secondary[Fallback: Llama 3 Node]
    Success -- Yes --> End[Finish]
    Secondary --> End

3. Handling Tool Failures

When a tool fails (e.g., a specific Python library is missing or a website is blocking you), don't return an empty string. Return an instruction for the LLM.

Tool Failure Return: "Error: The Wikipedia API is currently unavailable. Please use the Google Search tool instead."

By sending this back to the "Brain," the agent can pivot its strategy rather than crashing.

4. Visualizing the "Safe Path"

graph LR
    START --> Main[Main Task]
    Main --> Check{Valid?}
    Check -- Yes --> Final[End]
    Check -- No --> Correction[Self-Correction Node]
    Correction --> Check
    Check -- Fail x3 --> Fallback[Simple Backup Answer]

5. Implementation Strategy: The "Wrapper" Node

Create a node that handles the try/except logic for you:

def robust_agent_node(state: State):
    try:
        # Try the expensive, smart model
        return {"messages": [gpt_llm.invoke(state["messages"])]}
    except Exception:
        # Fallback to the cheaper, local model (Ollama)
        print("--- GPT-4 Failed. Falling back to local Llama3 ---")
        return {"messages": [local_llm.invoke(state["messages"])]}

Key Takeaways

Failures are inevitable in external AI environments.
Retry is for transient errors (random glitches).
Fallback is for permanent errors (API down, rate limit).
Feedback to the brain is the best way to handle tool-specific errors.
Resilience makes your agent look "Professional" rather than "Broken."

Module 7 Lesson 3: Retry and Fallback Nodes