Agent Safety: Guardrails and Failure Handling

Agent Safety: Guardrails and Failure Handling

Build resilient agentic systems that fail gracefully. Master the patterns for output validation, prompt injection protection, and automated error recovery.

Guardrails and Failure Handling

If a Python script fails, it throws an error and stops. If an AI agent fails, it might keep running, hallucinating new tasks, spending your money, and giving the user confidentially incorrect information. This is why Guardrails are not optional—they are the definition of production quality.

In this lesson, we will explore how to detect failure in an agent and how to build "Safety Nets" that catch the agent before it causes harm.


1. The Three Layers of Guardrails

To build a secure system, you need defense in depth.

Layer 1: The Input Guardrail (Prevention)

  • PII Filtering: Stop the agent from seeing things like Social Security Numbers.
  • Prompt Injection Detection: Identify when a user tries to say "Ignore your previous instructions."
  • Tools: LlamaGuard or AWS Bedrock Guardrails.

Layer 2: The Logic Guardrail (Validation)

  • Constraint Check: "The user only has a 'Silver' plan. The agent is trying to use a 'Gold' tool."
  • Max Steps: As covered in Lesson 3, stop the agent after $N$ loops.

Layer 3: The Output Guardrail (Recovery)

  • Hallucination Check: Using a second LLM to verify if the output is supported by the context.
  • JSON Validation: Ensuring the agent's output is actually valid JSON that your app can parse.

2. Handling Failure: The "Repair" Pattern

When a model fails (e.g., suggests a tool that doesn't exist), you have two choices: Crash or Repair.

The Repair Cycle

  1. The Error: Model calls search_db(customer_name="John") but the tool requires customer_id.
  2. The Catch: The code catches the TypeError.
  3. The Feed-back: Instead of crashing, you send a new message to the LLM:

    "ERROR: Tool 'search_db' requires 'customer_id'. You provided 'customer_name'. Please correct your input."

  4. The Fix: The LLM realizes the mistake, searches for the ID first, and then calls the tool again.

3. Detecting Hallucinations: Grounding the Agent

A common failure is an agent that claims it performed a task when it didn't.

  • User: "Did you send the email?"
  • Agent: "Yes, I sent it!" (But it never called the send_email tool).

The "Evidence" Pattern

In your system prompt, require the agent to provide Action Evidence.

  • "Every time you claim a task is done, you must cite the Tool Execution ID."

4. Structured Output: The "Pydantic" Guardrail

In production, we never trust raw strings from an LLM. We use Pydantic to enforce a schema.

from pydantic import BaseModel, Field

class FinalAnswer(BaseModel):
    answer: str = Field(description="The final response to the user")
    source_ids: list[str] = Field(description="List of IDs for the documents used")
    confidence: float = Field(ge=0, le=1)

# LangChain's 'with_structured_output' forces the model to follow this
structured_llm = llm.with_structured_output(FinalAnswer)

If the model fails to provide confidence, the system will throw a validation error before the user sees it, allowing you to retry or fallback.


5. Responsible Disclosure: Failing with Style

If an agent is truly stuck, it must admit it. Avoid: "Something went wrong." Better: "I attempted to find that file 3 times, but the server is not responding. I have saved my progress; should I try again in 5 minutes or would you like to speak to a human?"


6. Security Guardrails: The Sandbox

We will go deep into this in Module 7, but the ultimate guardrail for an agent that writes code is a Container.

  • Even if the agent tries to run rm -rf /, it only deletes files inside its temporary, isolated container.
  • Conclusion: Safety is as much about infrastructure as it is about prompting.

Summary and Mental Model

Think of Guardrails as the Curbs on a Bowling Lane. The AI is the bowling ball. It wants to go down the lane, but it might drift into the gutter (hallucination, infinite loop, security breach). The guardrails don't throw the ball for the agent, but they keep it on the path toward the pins.

The more "Autonomous" your agent, the "Taller" your guardrails must be.


Exercise: Failure Analysis

  1. The Feedback Loop: An agent is trying to search a database but the database password is wrong.
    • Should you tell the Agent the password is wrong? (Security Risk!).
    • What is a safe error message to give the agent so it can "Fail gracefully"?
  2. Pydantic Design: Create a Pydantic schema for an agent that recommends a travel itinerary. What fields would you make "Required" to prevent a vague or useless answer?
  3. The Human-in-the-Loop: At which point in a "Real Estate" agent's flow would you force a human review? (Initial search? Sending the contract? Signing the contract?)

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn