Continuous Improvement (A/B Testing)

A RAG system is never "Finished." Documents change, models improve, and user needs evolve. Continuous improvement is the process of testing new ideas against your current baseline.

The A/B Test Pipeline

Baseline (A): Your current RAG setup (e.g. 500-token chunks, OpenAI embeddings).
Challenger (B): Your new idea (e.g. 800-token chunks, Cohere re-ranker).
The Experiment: Send 10% of traffic to (B).
The Metrics: Compare Answer Accuracy, Latency, and User Satisfaction.

Automated Evaluation (RAGAS / DeepEval)

You don't need a human to grade every answer. Use an Evaluator LLM.

Input: Query, Context, and Answer.
Output: A score from 0-1 based on defined metrics (Faithfulness, Relevance).

Iteration Cycles

Weekly

Update the index with new documents.
Review "Thumbs Down" logs.

Monthly

Benchmark new models (Check MTEB leaderboard).
Adjust chunking strategies based on user feedback.

Quarterly

Full system re-evaluation. Is the metadata schema still sufficient?

Experiment Log example

Version	Date	Change	Result
v1.0	Jan 1	Initial Launch	Baseline
v1.1	Jan 15	Added Re-ranker	+12% Accuracy
v1.2	Feb 1	Switched to Markdown	-5% Noise

Exercises

Why should you only change One Variable at a time during an A/B test?
How many queries do you need to run to be "Statistically Significant"?
What is a "Regression Test"?