The Metrics of Graph Retrieval: Measuring Success

You’ve built the graph. You’ve refined the retrieval. But how do you know if it’s Better than a simple vector search? In the AI world, "Vibes" aren't enough. You need Metrics. Evaluating a Graph RAG system is harder than evaluating a standard RAG because you have to measure the accuracy of the Path, not just the accuracy of the Fact.

In this lesson, we will explore the three "Golden Metrics" for RAG: Recall, Precision, and Faithfulness. We will see how these are adapted for graphs—where Precision means getting the right nodes, and Faithfulness means ensuring the LLM didn't "Invent" a relationship that isn't on the graph.

1. Recall: Did We Find the Evidence?

Definition: The percentage of relevant facts retrieved from the graph.

In a Knowledge Graph, recall is binary. If the relationship (A)-[:OWNS]->(B) exists and the query requires it, did the Cypher query find it?

Low Recall: Usually caused by a poor Cypher generator or a disconnected graph (Islands).

2. Precision: Is the Context "Noisy"?

Definition: The percentage of retrieved facts that were actually useful for the answer.

If you retrieve 100 neighbors but the LLM only uses 5 to answer the question, your Precision is 5%. This is a problem! It means you are wasting money on tokens and risking "Lost-in-the-Middle" effects in the LLM.

The Graph Fix: Use PageRank Ranking (Module 11) to prune the retrieval before it hits the prompt.

3. Faithfulness (Groundedness): The Anti-Hallucination Metric

Definition: Does the answer only contain facts that exist in the retrieved subgraph?

This is the most important metric for Enterprise AI.

Fail: The LLM says "The project ends in 2025" but the graph says "2024".
Graph RAG's Advantage: Because the retrieval is a structured set of triplets, we can programmatically check if the LLM's claims match the graph's connections.

graph LR
    subgraph "The Triangle of RAG Metrics"
    R[Recall: Coverage]
    P[Precision: Noise]
    F[Faithfulness: Truth]
    end
    
    R --- P
    P --- F
    F --- R
    
    style R fill:#4285F4,color:#fff
    style P fill:#f4b400,color:#fff
    style F fill:#34A853,color:#fff

4. The "Path-Correctness" Metric (Advanced)

Unique to Graph RAG is the Path Metric.

If the answer to a question requires a 3-hop chain: A -> B -> C.
Full Credit: The system retrieves all three nodes and explains the chain.
Partial Credit: The system retrieves A and C but misses the "Bridge" B. (This leads to the AI guessing how they are related).

5. Implementation: A Basic Evaluation Helper in Python

def evaluate_groundedness(llm_answer, retrieved_triplets):
    # This is often done by a 'Judge LLM'
    prompt = f"""
    Check if the Answer is supported by the Fact List.
    Answer: {llm_answer}
    Fact List: {retrieved_triplets}
    
    If the Answer mentions a relationship NOT in the Fact List, 
    mark as HALLUCINATION.
    """
    # Return Score 0-1

6. Summary and Exercises

Metrics turn "AI Experiments" into "Production Engineering."

Recall tells you if your graph is comprehensive enough.
Precision tells you if your retrieval is efficient.
Faithfulness tells you if your LLM is behaving.
Path-Correctness measures the "Reasoning Integrity" of your multi-hop walks.

Exercises

Metric Duel: Your bot is 100% accurate but very slow (Retaining 1,000 facts per question). Is your Precision high or low?
Recall Check: If a user asks "Who is the manager?" and your system retrieves the "Email" but not the "Reporting Line," did you pass the Recall test?
Visualization: Draw a graph with 3 nodes. What is the Recall if your code only retrieves the first 2 nodes for a "Path" query?

In the next lesson, we will look at automating this with a judge: Using G-Eval for Graph-Grounded Evaluation.