Continuous Improvement (A/B Testing)

Continuous Improvement (A/B Testing)

Learn how to iteratively improve your RAG system using systematic testing and evaluation frameworks.

Continuous Improvement (A/B Testing)

A RAG system is never "Finished." Documents change, models improve, and user needs evolve. Continuous improvement is the process of testing new ideas against your current baseline.

The A/B Test Pipeline

  1. Baseline (A): Your current RAG setup (e.g. 500-token chunks, OpenAI embeddings).
  2. Challenger (B): Your new idea (e.g. 800-token chunks, Cohere re-ranker).
  3. The Experiment: Send 10% of traffic to (B).
  4. The Metrics: Compare Answer Accuracy, Latency, and User Satisfaction.

Automated Evaluation (RAGAS / DeepEval)

You don't need a human to grade every answer. Use an Evaluator LLM.

  • Input: Query, Context, and Answer.
  • Output: A score from 0-1 based on defined metrics (Faithfulness, Relevance).

Iteration Cycles

Weekly

  • Update the index with new documents.
  • Review "Thumbs Down" logs.

Monthly

  • Benchmark new models (Check MTEB leaderboard).
  • Adjust chunking strategies based on user feedback.

Quarterly

  • Full system re-evaluation. Is the metadata schema still sufficient?

Experiment Log example

VersionDateChangeResult
v1.0Jan 1Initial LaunchBaseline
v1.1Jan 15Added Re-ranker+12% Accuracy
v1.2Feb 1Switched to Markdown-5% Noise

Exercises

  1. Why should you only change One Variable at a time during an A/B test?
  2. How many queries do you need to run to be "Statistically Significant"?
  3. What is a "Regression Test"?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn