
Debugging Poor Answers
A systematic guide to diagnosing and fixing low-quality RAG outputs.
Debugging Poor Answers
When a RAG system gives a bad answer, don't just "change the prompt." Follow this systematic debugging flow to identify the actual root cause.
The Diagnostic Framework
| Symptom | Probable Cause | Fix |
|---|---|---|
| "I don't know" | Retrieval failed to find the doc. | Check chunking or search k-value. |
| Hallucination | Context found, but model ignored it. | Increase prompt strictness / verify grounding. |
| Mixed-up Facts | Documents in context contradict each other. | Re-rank by recency or authority. |
| Cut-off Answer | Output token limit reached. | Increase max_tokens. |
| Gibberish Output | Context window is too "noisy" (junk data). | Improve cleaning/conditioning step. |
The "Golden Question" Set
Create a list of 20-50 "test queries" for which you already know the perfect answer. Whenever you make a change (change a chunk size, change a model), run all these questions.
- If 5 improved but 10 got worse, the change was a failure.
Visualizing Attention (Heatmaps)
Some debugging tools allow you to see which Tokens in the context the model focused on when generating the answer.
- If the model focused on the "Header" instead of the "Data," your formatting is the problem.
Using a "Critic" Model
Run a model (like Claude 3 Opus) alongside your production model. Ask the Critic:
"Why is the production model's answer inferior to the reference answer? Is it missing a specific chunk of data?"
Exercises
- Intentionally provide the "Wrong" information in your context. Does the model correct it using its training data, or does it follow the wrong info?
- What is the impact of "Spelling Errors" in the user query on your retrieval accuracy?
- How do you debug a "Visual" RAG system where the model misreads a chart?