
Continuous Improvement (A/B Testing)
Learn how to iteratively improve your RAG system using systematic testing and evaluation frameworks.
Continuous Improvement (A/B Testing)
A RAG system is never "Finished." Documents change, models improve, and user needs evolve. Continuous improvement is the process of testing new ideas against your current baseline.
The A/B Test Pipeline
- Baseline (A): Your current RAG setup (e.g. 500-token chunks, OpenAI embeddings).
- Challenger (B): Your new idea (e.g. 800-token chunks, Cohere re-ranker).
- The Experiment: Send 10% of traffic to (B).
- The Metrics: Compare Answer Accuracy, Latency, and User Satisfaction.
Automated Evaluation (RAGAS / DeepEval)
You don't need a human to grade every answer. Use an Evaluator LLM.
- Input: Query, Context, and Answer.
- Output: A score from 0-1 based on defined metrics (Faithfulness, Relevance).
Iteration Cycles
Weekly
- Update the index with new documents.
- Review "Thumbs Down" logs.
Monthly
- Benchmark new models (Check MTEB leaderboard).
- Adjust chunking strategies based on user feedback.
Quarterly
- Full system re-evaluation. Is the metadata schema still sufficient?
Experiment Log example
| Version | Date | Change | Result |
|---|---|---|---|
| v1.0 | Jan 1 | Initial Launch | Baseline |
| v1.1 | Jan 15 | Added Re-ranker | +12% Accuracy |
| v1.2 | Feb 1 | Switched to Markdown | -5% Noise |
Exercises
- Why should you only change One Variable at a time during an A/B test?
- How many queries do you need to run to be "Statistically Significant"?
- What is a "Regression Test"?