Context Window Constraints

Every LLM has a "Context Window"—the maximum number of tokens it can process at a single time. In RAG, this window is shared between the system prompt, the user query, and the retrieved documents.

Hard Limits vs. Effective Limits

Hard Limit

If Claude 3.5 Sonnet has a 200,000 token limit, any token at 200,001 is simply ignored (or the API returns an error).

Effective Limit (Soft Limit)

As the context grows, the model may become:

Slower: Time to First Token (TTFT) increases.
Dumber: It might miss details in the middle of long docs ("Lost in the Middle").
More Expensive: Costs scale linearly with token count.

Token Allocation in RAG

A healthy RAG prompt usually looks like this:

System Prompt: 500 - 1,000 tokens (Rules, Tone, Format).
Retrieved Context: 2,000 - 10,000 tokens (Document Chunks).
History: 1,000 - 5,000 tokens (Past User Interactions).
Current Query: 100 - 500 tokens.

Strategies for Over-Capacity

When your retrieved docs exceed your target window:

Truncation: Throw away the lowest-ranked documents.
Summarization: Use a cheap model to condense the docs.
Map-Reduce: Process chunks in batches and then combine the summaries.

The Cost of Scale

Token Count	Prompt Cost (approx.)	Latency (approx.)
1k	$0.003	< 1s
10k	$0.03	2s
100k	$0.30	10s+

Exercises

If you are building a tool for a customer with a limited budget, what is your "Hard cap" for context tokens?
Why is "Prompt Caching" (available in Anthropic) a game-changer for long-context RAG?
What happens to the "User Experience" if the RAG system takes 15 seconds to respond because the context is too large?