Debugging Poor Answers

When a RAG system gives a bad answer, don't just "change the prompt." Follow this systematic debugging flow to identify the actual root cause.

The Diagnostic Framework

Symptom	Probable Cause	Fix
"I don't know"	Retrieval failed to find the doc.	Check chunking or search k-value.
Hallucination	Context found, but model ignored it.	Increase prompt strictness / verify grounding.
Mixed-up Facts	Documents in context contradict each other.	Re-rank by recency or authority.
Cut-off Answer	Output token limit reached.	Increase `max_tokens`.
Gibberish Output	Context window is too "noisy" (junk data).	Improve cleaning/conditioning step.

The "Golden Question" Set

Create a list of 20-50 "test queries" for which you already know the perfect answer. Whenever you make a change (change a chunk size, change a model), run all these questions.

If 5 improved but 10 got worse, the change was a failure.

Visualizing Attention (Heatmaps)

Some debugging tools allow you to see which Tokens in the context the model focused on when generating the answer.

If the model focused on the "Header" instead of the "Data," your formatting is the problem.

Using a "Critic" Model

Run a model (like Claude 3 Opus) alongside your production model. Ask the Critic:

"Why is the production model's answer inferior to the reference answer? Is it missing a specific chunk of data?"

Exercises

Intentionally provide the "Wrong" information in your context. Does the model correct it using its training data, or does it follow the wrong info?
What is the impact of "Spelling Errors" in the user query on your retrieval accuracy?
How do you debug a "Visual" RAG system where the model misreads a chart?