
Latency Optimization: Building High-Speed AI Agents
Master the race against time. Learn techniques for optimizing Time-to-First-Token (TTFT), implementing parallel execution, and selecting the right Gemini models to build agents that respond in milliseconds.
Latency Optimization: Building High-Speed AI Agents
In the world of AI agents, Latency is the Killer of UX. If a user asks a question and has to wait 15 seconds for a response, they will perceive the system as "broken" or "slow," regardless of how intelligent the answer is. When building with the Gemini ADK, you must optimize for two metrics: Time to First Token (TTFT) (how fast the model starts talking) and Total Execution Time (how fast the entire task finishes).
In this lesson, we will explore the engineering techniques for shaving seconds off your agentic loops, including model tiering, parallel execution, and the psychology of streaming.
1. Understanding the Latency Pipeline
To optimize speed, you must first understand where the time is spent.
- Network Overhead: Data traveling from your server to Google's API.
- Pre-fill (Prompt Processing): The model reading your input and context.
- Inference (Generation): The model calculating and outputing tokens.
- Tool Execution: Time spent running your Python code or waiting for an external API.
graph LR
A[Request Sent] -->|Net| B[Gemini Read Input - Pre-fill]
B -->|Fast Flash| C[Gemini Generate - Inference]
B -->|Slow Pro| C
C -->|Tool Call| D[External API / Local Code]
D -->|Observation| B
C -->|Final Text| E[User Display]
2. Model Tiering: Flash vs. Pro
The most effective latency optimization is choosing Gemini 1.5 Flash whenever possible.
- Gemini Pro: High reasoning, high latency (can take 2-5 seconds for TTFT). Best for "Final Synthesis."
- Gemini Flash: Lower reasoning, ultra-low latency (often sub-second TTFT). Best for "Routing" and "Small Tool Calls."
The Pro Strategy: Use Flash for the "Inner Loops" (searching, parsing, error checking) and only use Pro for the "Outer Final Response."
3. Parallelism: The "Fan-Out" Pattern
If your agent needs to call three tools (e.g., Get Weather, Get Traffic, Get News), don't do them one by one.
Sequential (Slow):
- Call Tool 1 (2s) -> Call Tool 2 (2s) -> Call Tool 3 (2s) = 6 seconds.
Parallel (Fast):
- Call Tools 1, 2, and 3 simultaneously via
asyncio= 2 seconds.
ADK Implementation: Gemini 1.5 supports Parallel Function Calling. It can emit multiple tool requests in a single turn. Your Python code should be written defensively to execute these in parallel.
4. Prompt Optimization: Less is More
Every token you send to the model adds to the Pre-fill latency.
- Be Precise: Avoid redundant text in your system instruction.
- Structure with Markdown: Use clear headers (
#,##). Models parse structured text faster than rambling paragraphs. - Limit History: Don't send the entire 50-turn history if only the last 5 turns are relevant.
5. The UX of Speed: Streaming
As we learned in Module 10, Streaming doesn't change the Total Time, but it changes the Perceived Time.
- Non-Streaming: User waits 5s -> Full paragraph appears. (Feels slow).
- Streaming: User waits 0.5s -> Words start appearing one by one. (Feels instant).
Always use stream=True for any agent intended for human interaction.
6. Optimization for Tools and API Connectors
If your tool calls a slow external API (like a 10-second DB query), the agent will hang.
- Timeouts: Set a hard timeout (e.g., 3s) for every tool. If it exceeds, return an error back to Gemini so it can try a different approach.
- Async/Await: Use
asyncioto prevent a slow tool from blocking your server's main thread.
7. Implementation: A Parallel Execution Wrapper
Let's look at how we can handle multiple tool calls concurrently in Python.
import asyncio
import google.generativeai as genai
# A slow mock tool
async def slow_fetch(item: str):
await asyncio.sleep(2) # Simulate network lag
return f"Data for {item}"
async def handle_parallel_tools(tool_calls):
# 'tool_calls' is a list of calls from Gemini
tasks = []
for call in tool_calls:
# We wrap each call in a task
tasks.append(slow_fetch(call.args['item']))
# Run all tasks concurrently
results = await asyncio.gather(*tasks)
return results
# This approach reduces wait time from 2s*N to a flat 2s.
8. Summary and Exercises
Latency optimization is an Engineering Discipline, not a prompt trick.
- Flash is the model for speed.
- Parallelism collapses the time vector.
- Streaming optimizes for human psychology.
- Caching (Module 14.1) eliminates the pre-fill bottleneck for long-context.
Exercises
- Latency Audit: Time an agentic session. How much time is spent in "Pre-fill" vs "Generation"? (Hint: Use timestamps before and after the API call).
- Parallel Challenge: Rewrite a "Travel Trip" agent that searches for Hotels and Flights sequentially. Change it to use Parallel Function Calling. Measure the speed improvement.
- Prompt Pruning: Take a 2,000-word system instruction. Try to cut it down to 500 words without losing the agent's "Identity." Does the response time improve?
In the next lesson, we will look at Monitoring and Observability, learning how to track these latency metrics at scale.