Handling Long Contexts

Claude 3.5 Sonnet supports up to 200,000 tokens. To put that in perspective, that's over 400 pages of text. While this opens up massive RAG possibilities, it also presents new challenges.

When to Use Long Context RAG

Legal Discovery: Loading 50 contracts into a single prompt for cross-comparison.
Deep Code Analysis: Ingesting an entire repository to find a bug.
Historical Summarization: Analyzing a decade of quarterly reports in one go.

Operational Challenges

1. Latency

A 100k token prompt can take 20-30 seconds to process (Time to First Token). This makes it unsuitable for "Interactive Chat" but perfect for "Analytical Reports."

2. Context Caching (Critical)

Anthropic's Prompt Caching allows you to "save" the first 100k tokens of a prompt and only pay/wait for the new additions.

Setup: Mark the context block with a cache_control flag.
Benefit: Up to 90% cheaper and 10x faster for repeated queries.

3. Context Window Saturation

If you fill the window to 90% capacity, the model's reasoning accuracy may dip. Target 50-70% occupancy for mission-critical tasks.

Best Practices for Long Prompts

Structure Clearly: Use Headers, Page Numbers, and ID tags.
Summarize Chunks: If you have 500 documents, don't send all of them. Send the top 50 in full and summaries of the rest.
Chunked Generation: If you need to summarize 200k tokens, don't ask for one summary. Use a "sliding window" or "hierarchical" summarization approach.

Implementation Example: Mapping large files

{
  "system": "Analyze the following documentation repository.",
  "messages": [
    {
      "role": "user", 
      "content": [
        {
          "type": "text", 
          "text": "File: main.py\n(10,000 lines of code...)",
          "cache_control": {"type": "ephemeral"}
        },
        {"role": "user", "content": "Where is the auth logic?"}
      ]
    }
  ]
}

Exercises

Calculate the token count of your favorite book. Would it fit in Claude's window?
How does "Prompt Caching" change the ROI (Return on Investment) of a RAG system?
What is a "Map-Reduce" strategy for summarization?