Prompt Caching: Reducing Latency and Cost in Long-Context Agents

Gemini 1.5 Pro's 2-million token context window is a developer's dream, but it can also be a financial nightmare. If you send a 1-million token codebase to the model in every turn of a conversation, you are paying for those 1 million tokens over and over again. Furthermore, the model has to "re-read" those tokens on every turn, adding 60+ seconds of latency to every interaction.

The solution is Prompt Caching (also known as Context Caching). This feature allows the model to "pre-process" and store a large block of text so that it doesn't have to be re-processed for subsequent calls. In this lesson, we will explore the economics of caching, learn how to implement it using the Gemini ADK, and see how it transforms the performance of long-context agents.

1. What is Prompt Caching?

When you send a request to an LLM, the model must "attend" to every token in the prompt. This "Pre-fill" phase is computationally expensive. Prompt Caching takes the state of the model after it has read the first $N$ tokens and stores it in high-speed memory on Google's servers.

The Benefits:

Lower Latency: Follow-up queries that use the cache respond much faster (often by 50-90%).
Lower Cost: You pay a one-time "Cache Write" fee and a small "Storage" fee, but subsequent "Cache Hits" are significantly cheaper than the standard input token price.

2. When to Use Caching

Prompt Caching is ideal for Large, Static Datasets that you need to access repeatedly.

Core Use Cases:

Codestore Agents: Caching 10,000 files of source code while you ask the agent to find bugs or refactor functions.
Document Analysts: Caching 500 pages of legal contracts while you ask a series of specific compliance questions.
Long Chat Sessions: Caching the first 50 turns of a complex, month-long planning project so the agent never loses context.
Multimodal Archives: Caching a 1-hour video so you can ask multiple questions about specific events without re-uploading the whole file.

3. The Economics of Caching

Google Cloud (and AI Studio) uses a "TTL" (Time To Live) model for caches.

Write Cost: The initial cost to ingest the data into the cache.
Storage Cost: A fee per hour while the data remains in the cache.
Hit Cost: The discounted rate you pay when a prompt successfully matches the cached tokens.

The Golden Threshold: Typically, if you plan to ask more than 4-5 questions about the same large context (>32k tokens), caching becomes more cost-effective than sending raw tokens every time.

graph TD
    A[Start: Initial Turn] --> B[Generate Tokens]
    B --> C[Write to Cache]
    C --> D[Generate Response]
    D --> E[Wait for Next Turn]
    E --> F[New User Prompt]
    F --> G{Cache Hit?}
    G -->|Yes| H[Retrieve Cached Context - FAST & CHEAP]
    G -->|No| I[Re-process Whole Context - SLOW & EXPENSIVE]
    H --> J[Generate Response]
    
    style H fill:#34A853,color:#fff
    style I fill:#EA4335,color:#fff

4. Implementation: Persistent Caching with the Python SDK

Let's look at how we create and use a cache for a massive technical manual.

import google.generativeai as genai
import datetime

# 1. Prepare Big Context
with open("industrial_manual_v10.txt", "r") as f:
    manual_data = f.read()

# 2. Create the Cache
# We set a TTL (Time-To-Live) for example: 2 hours
cache = genai.caching.CachedContent.create(
    model='models/gemini-1.5-pro-002',
    display_name='operations_manual',
    system_instruction='You are a factory operations assistant. Use the manual to answer questions.',
    contents=[manual_data],
    ttl=datetime.timedelta(hours=2),
)

# 3. Use the Cache in an Agent
# Note: We bind the model TO THE CACHE
model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# First Prompt (Uses Cache - Fast!)
response1 = model.generate_content("What is the shutdown procedure for Boiler 4?")
print(response1.text)

# Second Prompt (Still Uses Cache - Fast!)
response2 = model.generate_content("What are the safety requirements for Boiler 4?")
print(response2.text)

# 4. Cleanup (You can delete manually or let TTL expire)
# cache.delete()

5. Identifying Cache Hits

How do you know if your cache is working? The API response includes a usage_metadata field.

cached_content_token_count: Shows how many tokens were retrieved from the cache.
If this number is greater than 0, your cache is active and saving you money.

6. Limits and Constraints

Minimum Size: Caching only works for prompts larger than a certain threshold (currently 32,768 tokens). For small prompts, the overhead of caching isn't worth the gain.
Model Specificity: A cache created for gemini-1.5-pro cannot be used with gemini-1.5-flash.
Static Prefix: The cached tokens must be at the very beginning of the prompt. If you change even one character at the start of the manual, the cache will "miss" and you'll pay full price.

7. Strategy: The "Rolling Chat Cache"

For very long conversations, you can implementation a Rolling Cache.

Conversation reaches 50,000 tokens.
Your app takes the first 45,000 tokens and creates a cache.
Future turns use this cache.
Once more turns happen, you "refresh" the cache with the new material.

8. Summary and Exercises

Prompt Caching is the Efficiency Engine of the Gemini ADK.

It solves the latency and cost problems of long-context.
It is ideal for Static Datasets (Code bases, Manuals, Long Chats).
It requires a minimum threshold of 32k tokens to be active.
It provides a significant discount on input token pricing.

Exercises

Cache Scenario: You have 100 separate 10-page PDFs. You want to build a "Library Agent." Should you create 100 separate caches, or one giant cache with all PDFs? (Hint: Think about the 32k minimum and the 2M maximum context).
Cost Comparison: Calculate the cost of 5 turns with a 1,000,000 token prompt WITHOUT caching. Now calculate it with caching (Initial write + 5 hits). Use current Google Cloud pricing. What is the % of savings?
Cache Freshness: You have a cached codebase. You change one line of code in the middle of a file. Does the cache still hit? If not, how do you handle "Incremental Updates" to a cache?

In the next lesson, we will look at Latency Optimization, exploring how to make our agents feel instantaneous.