Module 7 Lesson 3: Retry and Fallback Nodes
Resilience by design. How to handle tool failures and rate limits within the agent graph.
Retry and Fallback: Building Resilient Agents
In a typical Python script, an API call failing (Error 500 or 429) usually means the program stops. In an agentic system, we can use Retry and Fallback Nodes to keep the agent alive by trying a different path or a different model.
1. The Retry Pattern (In-Node)
Sometimes, the model just needs one more try. You can wrap your LLM call in a simple retry loop using libraries like tenacity.
from tenacity import retry, stop_after_attempt
@retry(stop=stop_after_attempt(3))
def call_llm(messages):
return llm.invoke(messages)
2. The Fallback Pattern (Graph-Based)
What if GPT-4o is down globally? Or what if your Search API is rate-limited? We create a Secondary Node.
graph TD
Primary[Primary: GPT-4o Node] --> Success{Success?}
Success -- No --> Secondary[Fallback: Llama 3 Node]
Success -- Yes --> End[Finish]
Secondary --> End
3. Handling Tool Failures
When a tool fails (e.g., a specific Python library is missing or a website is blocking you), don't return an empty string. Return an instruction for the LLM.
Tool Failure Return:
"Error: The Wikipedia API is currently unavailable. Please use the Google Search tool instead."
By sending this back to the "Brain," the agent can pivot its strategy rather than crashing.
4. Visualizing the "Safe Path"
graph LR
START --> Main[Main Task]
Main --> Check{Valid?}
Check -- Yes --> Final[End]
Check -- No --> Correction[Self-Correction Node]
Correction --> Check
Check -- Fail x3 --> Fallback[Simple Backup Answer]
5. Implementation Strategy: The "Wrapper" Node
Create a node that handles the try/except logic for you:
def robust_agent_node(state: State):
try:
# Try the expensive, smart model
return {"messages": [gpt_llm.invoke(state["messages"])]}
except Exception:
# Fallback to the cheaper, local model (Ollama)
print("--- GPT-4 Failed. Falling back to local Llama3 ---")
return {"messages": [local_llm.invoke(state["messages"])]}
Key Takeaways
- Failures are inevitable in external AI environments.
- Retry is for transient errors (random glitches).
- Fallback is for permanent errors (API down, rate limit).
- Feedback to the brain is the best way to handle tool-specific errors.
- Resilience makes your agent look "Professional" rather than "Broken."