
Avoiding Context Pollution
Techniques for ensuring only relevant, high-quality data enters your generation prompt.
Avoiding Context Pollution
Context Pollution occurs when irrelevant, contradictory, or low-quality information is included in the prompt, leading the LLM to provide incorrect or confusing answers.
Sources of Pollution
- Poor Retrieval: The vector DB returned a document that "sounds" similar but is about a different topic.
- Boilerplate: Advertisements, navigation menus, or disclaimer text from a website.
- Outdated Info: Old versions of a document that contradict the new version.
- Formatting Noise: Uncleaned HTML tags, OCR gibberish, or malformed JSON.
Strategies for Prevention
1. The Similarity Threshold
Never include a document in your context unless its similarity score is above a safe threshold (e.g., > 0.70).
2. Contextual De-noising
Before building the prompt, run a regex or a simple classifier to strip out URLs, email addresses, or common navigation items.
3. Diversity Sampling
If you retrieve 5 documents that are 98% identical, only keep one. Including 5 copies of the same text "pollutes" the window with redundant tokens.
4. Semantic Filtering (Post-Retrieval)
Use a smaller model (like Haiku) to quickly scan retrieved chunks:
"Is this document relevant to the user query? Reply with Yes or No."
Impact on Output Quality
Polluted context leads to "Distraction Case" where the model focuses on a minor detail in a noisy document rather than the main fact in the high-quality document.
| Context State | Accurate Answers | Hallucination Risk |
|---|---|---|
| Clean / Precise | High | Low |
| Redundant | Medium | Low |
| Noisy / Irrelevant | Low | High |
Exercises
- Why might an LLM focus on an "Ad" in a retrieved web page context?
- How would you handle a situation where two retrieved documents directly contradict each other?
- What is the benefit of using "Instruction Following" in your system prompt to tell the model to ignore irrelevant info?