Avoiding Context Pollution

Avoiding Context Pollution

Techniques for ensuring only relevant, high-quality data enters your generation prompt.

Avoiding Context Pollution

Context Pollution occurs when irrelevant, contradictory, or low-quality information is included in the prompt, leading the LLM to provide incorrect or confusing answers.

Sources of Pollution

  1. Poor Retrieval: The vector DB returned a document that "sounds" similar but is about a different topic.
  2. Boilerplate: Advertisements, navigation menus, or disclaimer text from a website.
  3. Outdated Info: Old versions of a document that contradict the new version.
  4. Formatting Noise: Uncleaned HTML tags, OCR gibberish, or malformed JSON.

Strategies for Prevention

1. The Similarity Threshold

Never include a document in your context unless its similarity score is above a safe threshold (e.g., > 0.70).

2. Contextual De-noising

Before building the prompt, run a regex or a simple classifier to strip out URLs, email addresses, or common navigation items.

3. Diversity Sampling

If you retrieve 5 documents that are 98% identical, only keep one. Including 5 copies of the same text "pollutes" the window with redundant tokens.

4. Semantic Filtering (Post-Retrieval)

Use a smaller model (like Haiku) to quickly scan retrieved chunks:

"Is this document relevant to the user query? Reply with Yes or No."

Impact on Output Quality

Polluted context leads to "Distraction Case" where the model focuses on a minor detail in a noisy document rather than the main fact in the high-quality document.

Context StateAccurate AnswersHallucination Risk
Clean / PreciseHighLow
RedundantMediumLow
Noisy / IrrelevantLowHigh

Exercises

  1. Why might an LLM focus on an "Ad" in a retrieved web page context?
  2. How would you handle a situation where two retrieved documents directly contradict each other?
  3. What is the benefit of using "Instruction Following" in your system prompt to tell the model to ignore irrelevant info?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn