End-to-End Performance Benchmarking: Latency vs. Wisdom

In Graph RAG, we have a "Speed Limit." Unlike a simple vector search (which is almost instantaneous), a graph retrieval involves multiple stages that all consume time. To provide a "Snappy" experience for your users, you must understand where the seconds are being spent.

In this lesson, we will learn how to perform an End-to-End Benchmark. We will look at the Latency Breakdown of a typical query and identify the "Big Three" bottlenecks: Embedding Time, Graph Traversal Time, and LLM Synthesis Time. We will see why "Faster is not always better" and how to find the sweet spot between a 2-second answer and a 10-second "Brilliant" answer.

1. The Latency Breakdown

A typical Graph RAG query follows this timeline:

Embedding (Vectorizing the Query): 200ms - 500ms.
Index Lookup (Finding the Entrance Node): 50ms - 100ms.
Graph Traversal (The Multi-Hop Walk): 100ms - 2s (Depends on depth).
Network Overhead (DB to App): 50ms.
LLM Thinking (Synthesis): 2s - 8s (Depends on model and context size).

Total: 3s - 11s.

2. Identifying the Bottleneck

If Graph Traversal is slow (>1s): You likely have a "Fan-out" problem (Module 8) or missing indexes.
If LLM Thinking is slow (>5s): Your retrieved context is too large ("Information Overload"). You should prune your results more aggressively.

3. The "Throughput" Metric

Besides individual query speed (Latency), you must measure Throughput: how many concurrent users can your graph handle?

Graph databases are "Compute Heavy." If 100 users all run a 5-hop Dijkstra path search at the same time, your database CPU will hit 100%.

Solution: Use Read Replicas (Module 7) and Caching. If the same question is asked twice, don't walk the graph again—serve the context from a fast cache like Redis.

graph LR
    U((User)) -->|Question| E[Embedding Agent]
    E -->|100ms| G[Graph Engine]
    G -->|500ms| S[LLM Synth]
    S -->|3s| A[Answer]
    
    subgraph "Latency Profile"
    E
    G
    S
    end
    
    style G fill:#4285F4,color:#fff
    style S fill:#34A853,color:#fff

4. Implementation: A Basic Latency Logger in Python

import time

def timed_retrieval(question):
    start = time.time()
    
    # 1. Embed
    t1 = time.time()
    vector = get_embedding(question)
    print(f"Embedding: {time.time() - t1:.2f}s")
    
    # 2. Graph 
    t2 = time.time()
    facts = query_graph(vector)
    print(f"Graph DB: {time.time() - t2:.2f}s")
    
    # 3. LLM 
    t3 = time.time()
    answer = call_llm(facts, question)
    print(f"LLM Synthesis: {time.time() - t3:.2f}s")
    
    print(f"TOTAL: {time.time() - start:.2f}s")
    return answer

5. Summary and Exercises

Benchmarking allows you to make Data-Driven Infrastructure Decisions.

Profiling identifies which part of the "Tubes" is clogged.
Aggressive Pruning is the best way to speed up the LLM synthesis.
Indexing and Caching are the best ways to speed up the Graph engine.
User Expectations: A "General Summary" should be fast; an "Investigative Audit" can be slow.

Exercises

Latency Math: If your LLM charges $0.01 per 1,000 tokens, and your Graph Traversal returns 10,000 tokens of context, what is the "Cost-per-Query"? How can you reduce this while maintaining accuracy?
Bottleneck Search: If the user question is 2 words long ("Who's CEO?") and the answer takes 10 seconds, where is the most likely problem?
Visualization: Draw a graph representing "Throughput for 10 users." Show how adding a "Read Replica" doubles the capacity.

In the final lesson of Module 12, we will look at how to get better every day: Continuous Improvement: The Feedback Loop.