Monitoring and Observability: Tracking Tokens, Costs, and Success

Once your Gemini ADK agent is deployed, it becomes a "Black Box." You can't see what it's thinking in real-time unless you build the infrastructure to peer inside. Furthermore, because agents are non-deterministic and use tokens aggressively, an un-monitored agent can quickly exceed your budget or start producing "Logical Drift" without you ever knowing.

In this lesson, we will explore the Three Pillars of Agent Observability: Efficiency Monitoring (Tokens/Cost), Operational Monitoring (Errors/Latency), and Behavioral Monitoring (Success Rates/Safety).

1. Pillar 1: Efficiency Monitoring (Tokens & Cost)

In an agentic loop, costs are non-linear. Every turn adds the entire previous history to the prompt.

What to Track:

Total Token Count: Sum of input + output tokens per session.
Cache Hit Ratio: How much money are you saving with Prompt Caching?
Cost per Task: How much does it cost to solve a "Customer Ticket"?

Implementation Strategy:

Every Gemini API response includes a usage_metadata block. You should extract this block and send it to your time-series database (like InfluxDB or BigQuery).

{
  "prompt_token_count": 1050,
  "candidates_token_count": 150,
  "total_token_count": 1200
}

2. Pillar 2: Operational Monitoring (Errors & Latency)

You need to know when the "Nervous System" of your agent is failing.

Key Metrics:

Tool Error Rate: How often are your Python functions failing?
TTFT (Time to First Token): Is your agent starting its response fast enough?
Recursion Depth: Is the agent hitting a loop where it calls the same tool 10 times? This is a sign of "Agentic Failure."

3. Pillar 3: Behavioral Monitoring (Success & Safety)

This is the hardest type of monitoring. How do you know if the agent actually helped the user?

A. The "Evaluator Agent" Pattern

One of the best ways to monitor an agent is to use another agent (The Critic). At the end of every session, you send the log to a "Critic Agent" and ask: "On a scale of 1-10, how well did the Worker solve the user's problem? Did they violate any safety rules?"

B. User Feedback (Implicit vs. Explicit)

Explicit: Did the user click a "Thumbs Up" button?
Implicit: Did the user continue the conversation, or did they stop and ask the same question again?

graph TD
    A[Agent Session] --> B[Usage Metadata]
    A --> C[Tool Execution Logs]
    A --> D[Final Response]
    
    B --> E[Cost Dashboard]
    C --> F[Error Alerting]
    D --> G[Evaluator Agent]
    
    G -->|Grade| H[Quality Dashboard]
    
    style G fill:#4285F4,color:#fff
    style F fill:#EA4335,color:#fff

4. Using Google Cloud Monitoring and Trace

If you are deploying on Google Cloud, you can integrate with Cloud Trace.

Each turn in an agentic loop is recorded as a "Span."
You can see exactly how much time was spent in the Model vs. the Database vs. the External API.

5. Implementation: A Logging Middleware

Let's build a simple Python wrapper that logs every turn to a central file for later analysis.

import json
import logging

# Setup a logger
logging.basicConfig(filename='agent_observability.log', level=logging.INFO)

def log_agent_turn(session_id, user_input, model_response):
    # Extract metadata
    metadata = model_response.usage_metadata
    
    log_entry = {
        "session": session_id,
        "input": user_input,
        "input_tokens": metadata.prompt_token_count,
        "output_tokens": metadata.candidates_token_count,
        "total_cost": (metadata.total_token_count / 1000) * 0.001, # Example rate
        "success": not model_response.prompt_feedback.block_reason
    }
    
    logging.info(json.dumps(log_entry))

# In your main loop:
# response = chat.send_message(user_input)
# log_agent_turn("user_123", user_input, response)

6. Managing the "Trace Bloat"

In high-volume systems, logging every token of every turn can create gigabytes of data.

Sample your traces: Log 100% of errors, but only 5% of successful turns for analysis.
Redact PII: Ensure your logging infrastructure doesn't accidentally store the user's credit card number that the agent processed.

7. Defining "Alerting" Thresholds

Don't just watch the dashboards; set alerts.

Alert: If an agent call costs more than $1.00.
Alert: If more than 3 consecutive turns have the same tool call.
Alert: If the safety filter returns "BLOCKED" more than 5 times in an hour.

8. Summary and Exercises

Monitoring is the Feedback Loop of your business.

Efficiency Monitoring keeps the project profitable.
Operational Monitoring keeps the project reliable.
Behavioral Monitoring (Evaluator Agents) keeps the project intelligent.
Alerting prevents catastrophic budget or safety failures.

Exercises

Metric Selection: You are building an agent for a medical clinic. What is the single most important "Success Metric" you would monitor? What is the most important "Safety Metric"?
Dashboard Design: Sketch a dashboard showing 4 key charts for an agent. What would you put on those charts (e.g., Latency over time, Cost per user, etc.)?
Evaluator Prompt: Write a system instruction for an "Evaluator Agent" that reviews chat logs between a customer and a support bot. What criteria should it use to give a "Score" of 1 to 5?

In the next module, we move from measurement to deployment: Deploying Agents to Production (AWS & K8s).