Building a Golden Set: Test Data Generation

To run a framework like RAGAS (Lesson 2), you need a Golden Set: a list of real questions and their "Ground Truth" answers. But if you have 10,000 documents, writing 500 high-quality questions by hand could take weeks of expensive human labor.

In this lesson, we explore Synthetic Data Generation. We will learn how to use an LLM to "Read" your documents and generate a variety of questions—simple, complex, and adversarial—that form the benchmark for your vector database.

1. What makes a "Golden" Test Case?

A high-quality test case has three parts:

The Question: A realistic user query.
The Context: The specific paragraph(s) where the answer lives.
The Ground Truth: The "Perfect" answer according to your internal documentation.

2. Automated Testset Generation (Synthetic Data)

If you use RAGAS or LlamaIndex, you can automate the creation of these sets. The process works like this:

Document Sampling: The framework picks a random chunk from your vector database.
Fact Extraction: An LLM identifies the key facts in that chunk.
Question Formulation: The LLM writes a question that can only be answered by those specific facts.
Answer Generation: The LLM writes the "Ground Truth" answer based on the chunk.

graph TD
    D[Raw Document] --> C[Chunk A]
    C --> F[Fact: 'Company X started in 2005']
    F --> Q[Question: 'When did Company X begin?']
    F --> A[Answer: '2005']
    Q & A --> G[Golden Set Row]

3. Increasing Difficulty: Evolution

A good test set shouldn't just have easy questions. RAGAS uses a technique called Evolution to create harder test cases:

Reasoning Evolution: Takes a simple question and makes it require two steps of logic.
Multi-Context Evolution: Takes two unrelated chunks and asks a question that requires connecting them.
Conditional Evolution: Adds a constraint (e.g., "Answer for employees in California only").

4. Python Implementation: Generating a Test Set

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# 1. Initialize models
generator_llm = ChatOpenAI(model="gpt-4o")
critic_llm = ChatOpenAI(model="gpt-4o") # A 'critic' ensures the questions are good
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# 2. Generate
# documents = your list of LangChain document objects
testset = generator.generate_with_langchain_docs(
    documents, 
    test_size=10, 
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}
)

# 3. Save to CSV for evaluation
testset.to_pandas().to_csv("my_golden_set.csv")

5. Cleaning the Golden Set

Synthetic data is 90% accurate, but never 100%. The Human-in-the-loop Rule: After generating 100 questions, an engineer or subject-matter-expert should spend 1 hour "Auditing" the set.

Delete any nonsensical questions.
Fix any minor errors in the "Ground Truth" answers.

Once audited, this dataset becomes your Source of Truth for the rest of your system's life.

6. Real-World Value: Tracking Regressions

If you change your Chunking Size (Module 10) from 500 to 1000:

Run your RAG pipeline against your Golden Set.
Compare the old results to the new ones.
If the "Reasoning" questions fail more often with 1000-word chunks, you have Found a Regression.

Summary and Key Takeaways

A Golden Set is the "Unit Test" for your AI.

Synthetic Generation saves hundreds of hours of manual work.
Evolution ensures your test set covers both easy and complex queries.
The Critic/Judge Pattern: Use one LLM to create and another to verify.
Human Audit is the final step to ensuring your "Gold" is actually gold.

In the next lesson, we will look at Continuous monitoring and observability, learning how to track these same metrics for Live Users in the real world.

Exercise: Testset Strategy

You are building a RAG bot for a "Recipe Website."

Use the "Evolution" concept to turn a simple question ("How do I bake a cake?") into a Reasoning Evolution question.
Why is "Multi-context Evolution" (connecting two facts) specifically important for the Vector Recall portion of your evaluation?
If your synthetic test generator keeps creating "Nonsense" questions, would you change the LLM Model or the Chunk Size first?

Building a Golden Set: Automated Test Data Generation