Claude Prompt Caching: Slashing Costs by 90%

Claude Prompt Caching: Slashing Costs by 90%

Master the implementation of Anthropic's prompt caching. Learn how to use 'Cache Breakpoints', manage cost tiers, and build high-performance Claude applications.

Claude Prompt Caching: Slashing Costs by 90%

Anthropic was a pioneer in commercializing Prompt Caching for frontier models. For developers using Claude 3 (Haiku, Sonnet, Opus), caching is the single most effective way to scale a production app without exploding the budget.

In this lesson, we dive into the specific API implementations for Anthropic, learn how to place "Cache Breakpoints," and how to measure your "Cache Hit Rate" using Python metrics.


1. The 1,024 Token Minimum

Anthropic's caching only kicks in for large blocks. At the time of writing, you must cache at least 1,024 tokens for the feature to activate.

  • If your system prompt is 200 tokens, it won't be cached.
  • This encourages you to put More Context into your "Stable" prompt, as long as it's static.

2. Using "Cache Breakpoints"

Unlike some providers that automatically cache the whole prefix, Anthropic allows you to set Up to 4 Breakpoints. This provides granular control.

Scenario: A RAG app with a legal document.

  1. Breakpoint 1: System Instructions (Static).
  2. Breakpoint 2: The 5,000-token Legal Document (Static).
  3. End of Message: User query (Dynamic).
graph LR
    S[System Message] -->|Breakpoint 1| D[Legal PDF Text]
    D -->|Breakpoint 2| U[User Question]

Python Implementation: Sending Cache Hints

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    system=[
        {
          "type": "text", 
          "text": "You are a legal assistant for the following document.",
          "cache_control": {"type": "ephemeral"} # Breakpoint 1
        },
        {
          "type": "text", 
          "text": "[... 5000 tokens of PDF text ...]",
          "cache_control": {"type": "ephemeral"} # Breakpoint 2
        }
    ],
    messages=[{"role": "user", "content": "When does this contract expire?"}]
)

3. The Price Tiers for Claude Caching

MetricCost Factor
Standard Input100%
Cache Write (First time)125%
Cache Hit (Subsequent)10%

Wait, 125%? Yes. You pay a 25% "Premium" for the first time you write tokens into the cache. This means that if you only call your prompt once, caching actually increases your cost.

The Sweet Spot: You must reuse the same cached prompt at least twice to start saving money. If you reuse it 100 times, your effective cost per query drops by nearly 90%.


4. Measuring Success: The Usage Response

To build an ROI dashboard (Module 3.5), you must extract the usage metrics from the Anthropic response object.

# Assuming 'response' is from client.messages.create
usage = response.usage

print(f"Input Tokens: {usage.input_tokens}")
print(f"Output Tokens: {usage.output_tokens}")

# The specific caching fields:
cache_creation = getattr(usage, "cache_creation_input_tokens", 0)
cache_read = getattr(usage, "cache_read_input_tokens", 0)

print(f"Newly Cached: {cache_creation}")
print(f"Tokens Saved (Hits): {cache_read}")

5. Architectural Strategy: The "Golden Dataset"

If you have a multi-tenant app, avoid putting "User-Specific" data inside your cached block unless that user is extremely active.

Good Strategy: Cache the Industry Standard Regulations (Standard for all users). Bad Strategy: Cache Individual Account Stats (Changes for every user, low hit rate across the global pool).


6. Throughput Benefits on Sonnet 3.5

Using cached prompts on Anthropic significantly reduces Latency. Because Claude doesn't have to re-evaluate the 5,000 tokens of the PDF, the "Time to First Token" (TTFT) can drop from 3 seconds to under 400ms.


7. Summary and Key Takeaways

  1. 1,024 Token Floor: Use caching for large documents and complex system prompts.
  2. The 25% Premium: Caching is for repetitive queries. Don't use it for one-offs.
  3. 4 Breakpoints: Be strategic about where you stop the cache (Identity -> Context -> History).
  4. Metrics Tracking: Always log cache_read_input_tokens to calculate your real-world savings.

In the next lesson, Caching on AWS Bedrock, we look at چگونه to achieve these same savings in the Amazon ecosystem.


Exercise: The Breakpoint Plan

  1. Look at your most complex agent. It currently sends:
    • 200 tokens (Rules)
    • 4000 tokens (Knowledge Base)
    • 500 tokens (Last 2 Messages)
    • 100 tokens (Current Query)
  2. Where would you place your 4 breakpoints to maximize the hit rate?
  3. Calculate the cost of 5 messages if tokens are $3/1M.
  • (Hint: Without caching, you pay for the 4200 tokens of Rules/KB 5 times. With caching, you pay for them "1.25 times" once, and "0.1 times" 4 times).

Congratulations on completing Module 5 Lesson 2! You are now a Claude Efficiency expert.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn