Avoiding Context Pollution

Context Pollution occurs when irrelevant, contradictory, or low-quality information is included in the prompt, leading the LLM to provide incorrect or confusing answers.

Sources of Pollution

Poor Retrieval: The vector DB returned a document that "sounds" similar but is about a different topic.
Boilerplate: Advertisements, navigation menus, or disclaimer text from a website.
Outdated Info: Old versions of a document that contradict the new version.
Formatting Noise: Uncleaned HTML tags, OCR gibberish, or malformed JSON.

Strategies for Prevention

1. The Similarity Threshold

Never include a document in your context unless its similarity score is above a safe threshold (e.g., > 0.70).

2. Contextual De-noising

Before building the prompt, run a regex or a simple classifier to strip out URLs, email addresses, or common navigation items.

3. Diversity Sampling

If you retrieve 5 documents that are 98% identical, only keep one. Including 5 copies of the same text "pollutes" the window with redundant tokens.

4. Semantic Filtering (Post-Retrieval)

Use a smaller model (like Haiku) to quickly scan retrieved chunks:

"Is this document relevant to the user query? Reply with Yes or No."

Impact on Output Quality

Polluted context leads to "Distraction Case" where the model focuses on a minor detail in a noisy document rather than the main fact in the high-quality document.

Context State	Accurate Answers	Hallucination Risk
Clean / Precise	High	Low
Redundant	Medium	Low
Noisy / Irrelevant	Low	High

Exercises

Why might an LLM focus on an "Ad" in a retrieved web page context?
How would you handle a situation where two retrieved documents directly contradict each other?
What is the benefit of using "Instruction Following" in your system prompt to tell the model to ignore irrelevant info?