Testing and Debugging Tools: Mocks, Fakes, and Observability

Building a tool for an AI agent is a form of "Double-Ended" programming. You have to ensure the code works for the Machine (the Python runtime) and for the Agent (the Gemini reasoner). If your tool works perfectly but has a confusing docstring, the agent will never use it. If your docstring is perfect but you have a hidden NoneType error in your code, the agent will crash at runtime.

In this lesson, we will explore the Tool Testing Pipeline. We will learn how to unit test tool logic in isolation, how to use "Mocks" to simulate expensive external APIs, and how to use observability tools to debug the "Silent Failures" of the Gemini ADK.

1. The Three Layers of Tool Testing

To ensure your agent is production-ready, you must test on three distinct layers.

Layer 1: Unit Testing (The Code)

Does the Python function work correctly in isolation?

Goal: Verify logic, math, and data transformations.
Tools: pytest, unittest.

Layer 2: Integration Testing (The Schema)

Does the Gemini ADK successfully convert your function into a valid JSON schema?

Goal: Verify that all type hints and docstrings are present and compliant with the Gemini API.

Layer 3: End-to-End Testing (The Loop)

Does Gemini actually call the tool when prompted with a relevant user query?

Goal: Verify that the "Intent Mapping" (Docstring -> Reasoner) is working.

2. Mocking and Faking External APIs

You don't want to spend real money or hit your production Slack channel every time you run a test. We use Mocks (simulated objects) to return "Fake" but valid data.

Use Case: The Fake Weather Tool

Instead of calling a Real Weather API, your test returns a hard-coded JSON object. This ensures your tests are Determininstic and Free.

from unittest.mock import patch

# Our tool to test
def get_status_tool():
    resp = requests.get("https://api.system.com/status")
    return resp.json()['status']

# Our test
@patch('requests.get')
def test_get_status_tool(mock_get):
    # We define what the 'Fake' API returns
    mock_get.return_value.json.return_value = {"status": "ONLINE"}
    
    result = get_status_tool()
    assert result == "ONLINE"

3. Debugging "Tool Misses"

A "Tool Miss" is when a user asks a question, you have a tool for it, but Gemini decides not to use it.

Common Causes:

Vague Docstring: The docstring doesn't mention the keyword the user used.
Schema Conflict: Another tool has a similar description, and Gemini is "confused."
Low Temperature: If temperature is 0.0, Gemini might stick to its internal knowledge instead of "venturing out" to use a tool.

The Debugging Workflow:

Step 1: Print out the response.candidates[0].content.parts.
Step 2: Check for function_call. If it's empty, Gemini tried to answer via internal weights.
Step 3: Re-write the docstring to be more "Aggressive" (e.g., "ALWAYS use this tool if the user mentions money").

graph TD
    A[Launch Agent Test] --> B[Gemini Reasoner]
    B --> C{Tool Called?}
    C -->|Yes| D[Execute Tool]
    C -->|No| E[Examine 'Logic Trace']
    E --> F[Refine Docstring / Schema]
    F --> A
    D --> G{Output Correct?}
    G -->|No| H[Debug Python Function]
    G -->|Yes| I[Success]

4. Observability: Tracking Tool Performance

In production, you need to know which tools are failing and why.

Success vs. Failure Rate: How many times did the tool return an ERROR string?
Latency Distribution: Is the search_tool taking 10 seconds?
Argument Distributions: What are the most common arguments Gemini is passing? (This helps you spot "Drift" where Gemini starts using the tool for unintended purposes).

5. Validating Tool Output (Type Guarding)

Gemini creates a "Mathematical Plan" based on what your tool says it will return. If your docstring says the tool returns an int but it actually returns a list, the model's "Planning" step will be corrupted for the next turn.

The "Clean Result" Rule:

Tools should always return a Single, Predictable Type (usually a String or a simple JSON object). Avoid returning complex Python objects (Classes, FileHandles) because Gemini cannot "see" inside them.

6. Implementation: A `pytest` Suite for a Tool

Let's look at how we rigorously test a "Calculate ROI" tool.

import pytest

# Our Tool
def calculate_roi(investment: float, gain: float):
    """Calculates ROI %. Args: investment, gain."""
    if investment == 0:
        return "ERROR: Investment cannot be zero."
    return ((gain - investment) / investment) * 100

# Our Test Suite
def test_calc_roi_success():
    assert calculate_roi(100, 150) == 50.0

def test_calc_roi_failure():
    # Test that we handle errors with a string (instructional error)
    response = calculate_roi(0, 50)
    assert "ERROR" in response

def test_calc_roi_negative():
    assert calculate_roi(100, 50) == -50.0

7. The "Golden Dataset" for Agents

A Golden Dataset is a collection of 50-100 (User Input, Expected Tool Call) pairs.

Every time you change your System Prompt or your Tool Docstrings, you run this dataset.
If your "Accuracy" drops from 95% to 80%, you know your recent changes have triggered a "Regression."

8. Summary and Exercises

Testing is the difference between a "Toy" and a Tool.

Unit Tests prove the Python logic is sound.
Mocks prevent expensive and non-deterministic API calls during testing.
Observability identifies performance bottlenecks in production.
Golden Datasets protect your agent against regressions.

Exercises

Mocking Challenge: Write a test for a send_tweet tool. Use a mock to ensure that the "Twitter API" is never actually reached, but the function returns SUCCESS.
Docstring Debugging: Take a tool with a very vague name like data_manager. Write a user query that causes a "Tool Miss." Now, rename it and re-write the docstring to ensure Gemini always picks it.
Trace Analysis: Use the google-generativeai SDK to output the FunctionCall and FunctionResponse objects for a 3-turn interactive session. Can you follow the data flow perfectly?

In the next module, we move into the operational side of agents: Agentic RAG and Knowledge Bases.