Recall vs Latency: Tuning Your Vector Database for Performance

Recall vs Latency: Tuning Your Vector Database for Performance

Learn how to optimize vector search parameters. Understand the relationship between search speed and retrieval quality, and master tuning parameters like ef, M, and nprobe.

Recall vs Latency: The Great Balancing Act

In the world of traditional databases, "Performance" usually means throughput and query speed. In vector databases, performance is a multi-dimensional trade-off. You cannot increase speed without sacrificing accuracy (Recall), and you cannot increase accuracy without increasing resource consumption (RAM/CPU).

As a production AI engineer, your job is not to build the "fastest" or the "most accurate" system, but the most cost-effective and user-aligned system.

In this lesson, we will look at how to tune the engine. We will explore the specific parameters of HNSW and IVF and learn how to run "Recall-Latency Experiments."


1. Defining the Axes

Recall (The Quality Axis)

As we defined earlier, Recall is the percentage of "True" nearest neighbors found by the approximate search.

  • Low Recall (0.60): Fast but misses 40% of the best data.
  • High Recall (0.98): Slow but almost identical to exact search.

Latency (The Speed Axis)

The time it takes to return a result (Search Time).

  • Fast (less than 10ms): Essential for autocomplete or chat.
  • Slow (>200ms): Acceptable for batch processing or deep research.
graph LR
    A[Increase Search Effort] --> B[Higher Recall]
    A --> C[Higher Latency]
    D[Decrease Search Effort] --> E[Lower Recall]
    D --> F[Lower Latency]

2. Tuning the HNSW Engine

If you are using Pinecone (S1/P1 pods) or Chroma with HNSW, you have two primary levers to pull:

Lever 1: M (The Number of Connections)

This determines how many bidirectional links each vector has to its neighbors.

  • Default (usually 16 or 32): Good for most use cases.
  • Higher M: Better recall, but significantly more RAM and slower build times.

Lever 2: efConstruction and ef (Search Effort)

  • efConstruction: How hard the database tries to link nodes when building the index. (Affects indexing speed).
  • ef (or efSearch): How many nodes the database explores during a query.
    • Increase ef: You find better neighbors (Higher Recall), but the CPU has to do more work (Higher Latency).

3. Tuning the IVF Engine

If you are using OpenSearch or a clustering-based index, your levers are different:

Lever 1: nlist (Number of Clusters)

  • Higher nlist: More clusters mean smaller groups to search.
  • Impact: Speeds up search, but requires more training data and can lower recall if not balanced with nprobe.

Lever 2: nprobe (The Reach)

This is the most important "at-query-time" parameter. It defines how many neighboring clusters to search.

  • nprobe = 1: Fast but potentially inaccurate.
  • nprobe = 10: Checks 10 clusters. Slower but much higher recall.

4. The Pareto Frontier: How to Pick Your Spot

Engineers visualize this using a Recall-Latency Curve (The Pareto Frontier). You plot your experiments:

  1. Run search with ef=10. Record Recall and Latency.
  2. Run search with ef=50. Record Recall and Latency.
  3. Run search with ef=200. Record Recall and Latency.

You will see a point where increasing ef (effort) gives you almost no extra recall but exploding latency. This is your "Optimization Point."

xychart-beta
    title "Recall vs. Latency Curve"
    x-axis [10, 20, 50, 100, 200, 500]
    y-axis "Recall (%)" [70, 85, 92, 96, 98, 99.5]
    line [70, 85, 92, 96, 98, 99.5]

5. Python Example: Running a Recall Experiment

Let's use a conceptual script to see how you would calculate recall to find your optimal ef value.

import time
import numpy as np
# Assuming we use hnswlib, a popular C++ library with Python bindings
import hnswlib 

# 1. Setup
dim = 128
num_elements = 50000
data = np.random.rand(num_elements, dim).astype('float32')

# 2. Build Index
p = hnswlib.Index(space='cosine', dim=dim)
p.init_index(max_elements=num_elements, ef_construction=200, M=16)
p.add_items(data)

# 3. Get 'Ground Truth' using Exact Search (brute force)
query = np.random.rand(1, dim).astype('float32')
ground_truth_indices, _ = p.knn_query(query, k=10) # Using internal brute-force helper

# 4. The Experiment
ef_values = [10, 50, 100, 500]

print(f"{'efSearch':<10} | {'Recall':<10} | {'Latency (ms)':<15}")
print("-" * 40)

for ef in ef_values:
    p.set_ef(ef) # Adjust search effort
    
    start = time.time()
    labels, distances = p.knn_query(query, k=10)
    end = time.time()
    
    # Calculate Recall: How many of our results are in the ground truth?
    found_correct = len(set(labels[0]) & set(ground_truth_indices[0]))
    recall = found_correct / 10
    latency_ms = (end - start) * 1000
    
    print(f"{ef:<10} | {recall:<10.2f} | {latency_ms:<15.4f}")

6. Real-World Decision Framework

How do you choose your target recall?

Application TypeTarget RecallWhy?
E-commerce Recommendation0.80 - 0.90Low latency is more important than showing the "perfect" product.
General Chatbot (RAG)0.90 - 0.95You need the right context, but a small delay of 50ms is okay.
Enterprise Legal Search0.98+Missing a document has high business risk. Scaling hardware is cheaper than a lawsuit.
Real-time Anomaly Detection0.99+False negatives are unacceptable. High latency is tolerated.

Summary and Key Takeaways

Optimization is not a one-time task; it's an ongoing cycle of measurement.

  1. HNSW tuning relies on M (memory/graph density) and ef (search effort).
  2. IVF tuning relies on nlist (structure) and nprobe (reach).
  3. Always establish Ground Truth: You cannot measure recall if you don't know the "True" answer using exact search.
  4. The Goal is Efficiency: Find the lowest "search effort" that still meets your application's recall requirements.

In the next lesson, we will move beyond vectors and look at Filtering and Metadata Constraints. We will learn how to combine "The math of vectors" with "The logic of SQL."


Exercise: The Latency Budget

You are building an AI agent that must respond in 5 seconds.

  • LLM generation takes 4 seconds.
  • Embedding the query takes 0.5 seconds.
  • Networking overhead takes 0.2 seconds.
  1. What is your Latency Budget for the vector database search?
  2. If your current ef=200 search takes 600ms, and your ef=50 search takes 250ms, which one MUST you choose to stay within the 5-second total limit?
  3. How much Recall are you willing to sacrifice to meet the User Experience goal?

Designing for the total System Latency is the hallmark of a Senior AI Infrastructure Engineer.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn