Project: Evaluating Your AI Brain with RAGAS

Project: Evaluating Your AI Brain with RAGAS

Put your evaluation skills into practice. Build a test script that measures the quality of your RAG bot and produces a professional performance report.

Project: High-Stakes Evaluation

You have built the bot (Module 10). Now, you must prove it is safe for users. Welcome to the Module 11 Capstone Project.

In this exercise, we will take the Document Q&A bot you built in the last module and run it through a rigorous RAGAS Evaluation Pipeline. We will generate a synthetic test set, run our bot against it, and analyze the resulting scores to find our "Weakest Link."


1. Project Objectives

  • Generate: Create 10 "Golden Question" pairs from a PDF (Lesson 3).
  • Execute: Run your RAG bot to answer those 10 questions and capture the context.
  • Score: Use the RAGAS library to calculate Faithfulness and Answer Relevancy.
  • Analyze: Identify if your failures are in Retrieval or Generation.

2. Setting Up the Environment

You will need the ragas library and the dataset handlers.

pip install ragas datasets pandas

3. The Evaluation Script (evaluate_bot.py)

This script acts as the "Teacher" grading your RAG bot's homework.

import os
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision

# 1. SETUP ENVIRONMENT
os.environ["OPENAI_API_KEY"] = "your_key"

# 2. PREPARE THE DATASET
# In a real project, you would gather these from your bot's logs
# or generate them synthetically as shown in Lesson 3.
data = {
    "question": [
        "What is the company's refund period?",
        "Who is the CEO of Acme Inc?"
    ],
    "answer": [
        "The refund period is 30 days.",
        "The CEO is Jane Doe."
    ],
    "contexts": [
        ["Our refund policy allows returns within 30 days of purchase."],
        ["Founded by John Smith, Acme Inc is currently led by CEO Jane Doe."]
    ],
    "ground_truth": [
        "30 days",
        "Jane Doe"
    ]
}

dataset = Dataset.from_dict(data)

# 3. RUN THE RAGAS EVALUATION
def run_benchmark(test_dataset):
    print("Beginning Evaluation (calling Judge LLM)...")
    
    result = evaluate(
        test_dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_recall,
            context_precision
        ]
    )
    
    return result

# 4. EXECUTE & DISPLAY
if __name__ == "__main__":
    evaluation_results = run_benchmark(dataset)
    
    # Export to a nice table
    df = evaluation_results.to_pandas()
    print("\n--- PERFORMANCE SUMMARY ---\n")
    print(df[['question', 'faithfulness', 'answer_relevancy']])
    
    print("\nGlobal Mean Scores:")
    print(evaluation_results)

4. How to Read Your Report

If your report shows these scores:

  • Faithfulness: 1.0
  • Answer Relevancy: 0.5

Analysis: Your bot is telling the truth (it's faithful to the context), but it is answering the wrong question or providing irrelevant information. You likely need to tune your System Prompt to be more direct.

If your report shows:

  • Faithfulness: 0.6
  • Context Recall: 1.0

Analysis: You are finding the right documents (Perfect Recall), but the bot is making things up anyway (Low Faithfulness). This is a hallucination problem. You need to use a stronger LLM (GPT-4 vs GPT-3.5) or a stricter prompt.


5. Improving Your Scores: The Iteration Loop

Once you have your RAGAS baseline, pick one thing to change:

  1. Change your Chunk Size from 1000 to 500.
  2. Re-run evaluate_bot.py.
  3. Did the scores go up or down?

This is the way of the Senior AI Engineer. You don't guess; you measure.


6. Real-World Dashboard Simulation

For your final task, imagine this data is being pushed to a dashboard (like Grafana). Create a textual summary of your bot's health: "Our Support Bot currently has a 92% Faithfulness rating but suffers from a 70% Context Precision. We recommend upgrading the HNSW index configuration in Pinecone to improve search accuracy."


Summary and Module 11 Wrap-up

You have completed the full lifecycle of vector database engineering!

  • You can Store and Search vectors (Modules 3-6).
  • You can build Hybrid and Multi-Modal systems (Modules 7-9).
  • You can build RAG agents that cite their sources (Module 10).
  • You can Evaluate and Monitor your systems for production (Module 11).

The Journey Continues

You are now part of the small percentage of engineers who understand the Infrastructure behind the AI revolution.

Upcoming Modules (12-20) will dive into specialized topics like Agentic Workflows, Local Deployment (Ollama), and Enterprise Security (RBAC). But the core skills you've learned here—dimensionality, similarity math, indexing, and RAG—will be the foundation for everything you build next.


Final Project Exercise: The Benchmark Challenge

  1. Add a "Nonsense" question to your evaluation dataset (e.g., "What is the color of a Tuesday?").
  2. Run RAGAS.
  3. How does the "Answer Relevancy" score change?
  4. How would you update your System Prompt to handle nonsense questions more Gracefully while maintaining a high Faithfulness score?

Congratulations! You have completed the core "Vector Databases" certification prep. Your brain is now multi-dimensional.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn