Context Window Optimization

While modern LLMs like Claude 3.5 Sonnet have massive context windows (200k+ tokens), efficiency and accuracy are still paramount. Flooding the model with irrelevant context ("context pollution") increases cost, latency, and the risk of hallucination.

The "Needle in a Haystack" Problem

Research shows that as context grows, the model's ability to retrieve specific facts from the "middle" of the context decreases. This is known as the Lost in the Middle phenomenon.

Optimization Strategies

1. Token Counting

Always count your tokens before sending them to the LLM to avoid truncation errors.

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
num_tokens = len(enc.encode(context))

2. Information Compression

Instead of sending the full document, send only the most relevant snippets found during retrieval.

3. Rank-Based Pruning

Only include documents with a Re-Ranker score above a certain threshold (e.g., > 0.4). Quality over quantity.

4. Semantic Filtering

If you have multiple chunks that say the same thing (redundancy), keep only the highest-ranked one to save space.

Handling Long Contexts in Claude

Claude is exceptionally good at long contexts, but for production RAG:

Place the most critical information near the top or the bottom of the context.
Use XML-style tags to separate documents, as Claude is specifically trained to recognize them.

<documents>
  <document index="1">...</document>
  <document index="2">...</document>
</documents>

Exercises

Calculate the cost of sending 50,000 tokens to Claude Sonnet vs. 5,000 tokens.
If your Re-ranker returns 20 documents, but the context window is full after document 8, what should you do?
Why does "Summarizing" a chunk sometimes work better than "Truncating" it?