How Prompt Caching Works: The New ROI Layer

How Prompt Caching Works: The New ROI Layer

Discover the most powerful cost-saving feature in modern LLMs. Learn how prompt caching reduces latency and slashes repeating token costs by up to 90%.

How Prompt Caching Works: The New ROI Layer

Until recently, every request to an LLM was treated as "Fresh." If you sent a 10,000-word document and asked 10 questions about it, the provider would process those 10,000 words 10 times, and you would pay for them 10 times.

Prompt Caching changes everything. It allows the model provider to store the "pre-computed" state of your prompt on their GPUs. When you send the same prefix again, they simply look up the result in their cache.

In this lesson, we explore the mechanics of caching, the "90% Discount" rule, and why caching is the biggest breakthrough in token efficiency since the creation of RAG.


1. The Physics of Caching: K/V Cache

To understand prompt caching, we must look at how LLMs "Think." When a model processes text, it creates a mathematical mapping called the K/V Cache (Key/Value Cache). This is a representation of everything the model has seen so far.

Historically, this K/V cache was deleted after every API call. With Prompt Caching (available on Anthropic, DeepSeek, and OpenAI), that K/V cache is persisted.

graph TD
    A[Request 1: 5k tokens] --> B[Compute K/V Cache]
    B --> C[Store in GPU Memory]
    C --> D[Result]
    
    E[Request 2: SAME 5k tokens + New Msg] --> F{Cache Hit?}
    F -- Yes --> G[Instant Recall - Discount 90%]
    F -- No --> H[Re-compute - Full Price]

2. The "Cachable Prefix" Concept

Caching usually works on a Prefix basis. This means the model looks at your prompt from Start to End.

Cachable Pattern: [System Prompt (Cached)] + [Large Bio (Cached)] + [Latest Question (New)]

Non-Cachable Pattern: [System Prompt] + [Timestamp (Changes every time)] + [Large Bio]

If you put a piece of dynamic data (like a timestamp or a unique ID) at the beginning of your prompt, you break the cache for everything that follows it.


3. The Economics: Paying for Hits and Misses

Most providers use a two-tier pricing model for cached tokens:

  1. Cache Write: You pay a standard price to "Insert" the tokens into the cache.
  2. Cache Hit: You pay a 90% discount (approx.) for every token that is read from the cache in subsequent turns.

Outcome: If you have a long system prompt (2,000 tokens) and 10 turns of conversation, your total bill for those 2,000 tokens drops from 20,000 tokens to roughly 2,200 tokens.


4. Latency: The Sibling Benefit

Prompt caching doesn't just save money; it is the ultimate Latency Killer. Processing 10,000 tokens of input takes time (computing the K/V cache). A "Cache Hit" bypasses this computation step entirely.

Performance Gain:

  • No Cache: TTFT = 1.5 seconds.
  • Cache Hit: TTFT = 0.2 seconds.

For real-time agents, this speed difference is the difference between "Helpful" and "Frustrating."


5. Implementation: The Caching Header (FastAPI)

While specific SDKs vary (which we will cover in the next lesson), the general principle is telling the provider which part of the prompt to "Watch."

Python Concept: Identifying a Cache Point

{
  "messages": [
    {
      "role": "system",
      "content": "You are an expert. [LONG INSTRUCTION SET]",
      "cache_control": {"type": "ephemeral"} # Hint to cache this block
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ]
}

6. When NOT to Use Caching

Caching is not a silver bullet.

  • Short Prompts: If your prompt is only 100 tokens, the setup overhead of caching might not be worth the complexity.
  • High Variability: If you never send the same data twice, your "Hit Rate" will be 0%.
  • Privacy Sensitivity: If you are in a highly regulated industry (FinTech), you must ensure your provider's "Ephemeral Cache" meets your security standards.

7. Summary and Key Takeaways

  1. Prefix is King: Only the start of your prompt can be cached. Keep static data at the top.
  2. 90% Discounts: Caching is the most aggressive cost-cutting tool in existence.
  3. TTFT Reduction: Fast hits mean fast responses.
  4. Middleware Design: Your backend must be designed to group similar requests to maximize hits.

In the next lesson, Caching Strategies for Anthropic/Claude, we look at how to implement this specifically for one of the most popular frontier models.


Exercise: The Cache Break Test

  1. Predict which of these prompts will result in a cache hit if sent sequentially:
    • A: [System V1] + [User: Hello] then [System V1] + [User: Bye]
    • B: [User: Hello] + [System V1] then [User: Bye] + [System V1]
    • C: [System V1] + [Time: 10:00] + [Data] then [System V1] + [Time: 10:01] + [Data]
  • (Hint: Only A works. B fails because the prefix changed. C fails because the 'Time' variable broke the cache for the 'Data' block).

Congratulations on completing Module 5 Lesson 1! You are now entering the world of high-performance AI.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn