Debugging Poor Answers

Debugging Poor Answers

A systematic guide to diagnosing and fixing low-quality RAG outputs.

Debugging Poor Answers

When a RAG system gives a bad answer, don't just "change the prompt." Follow this systematic debugging flow to identify the actual root cause.

The Diagnostic Framework

SymptomProbable CauseFix
"I don't know"Retrieval failed to find the doc.Check chunking or search k-value.
HallucinationContext found, but model ignored it.Increase prompt strictness / verify grounding.
Mixed-up FactsDocuments in context contradict each other.Re-rank by recency or authority.
Cut-off AnswerOutput token limit reached.Increase max_tokens.
Gibberish OutputContext window is too "noisy" (junk data).Improve cleaning/conditioning step.

The "Golden Question" Set

Create a list of 20-50 "test queries" for which you already know the perfect answer. Whenever you make a change (change a chunk size, change a model), run all these questions.

  • If 5 improved but 10 got worse, the change was a failure.

Visualizing Attention (Heatmaps)

Some debugging tools allow you to see which Tokens in the context the model focused on when generating the answer.

  • If the model focused on the "Header" instead of the "Data," your formatting is the problem.

Using a "Critic" Model

Run a model (like Claude 3 Opus) alongside your production model. Ask the Critic:

"Why is the production model's answer inferior to the reference answer? Is it missing a specific chunk of data?"

Exercises

  1. Intentionally provide the "Wrong" information in your context. Does the model correct it using its training data, or does it follow the wrong info?
  2. What is the impact of "Spelling Errors" in the user query on your retrieval accuracy?
  3. How do you debug a "Visual" RAG system where the model misreads a chart?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn