The Query Engine: Execoting the Search

We have the Index (The Map) and the Storage (The Library). Now we need the Query Engine (The Librarian).

The Query Engine is the orchestration layer that takes a user's request and manages the pipeline of activity required to produce a result. In a modern vector database, this isn't just a simple lookup; it is a multi-stage execution tree that involves parallel processing, filtering, and cross-checking.

In this lesson, we will deconstruct the Query Lifecycle and learn about the "Scoring" mechanisms that define which results eventually make it to your LLM.

1. The Stages of a Vector Query

When you call db.query(vector=q, top_k=5), five distinct things happen inside the query engine:

Stage 1: Parsing and Validation

The engine checks if the query is well-formed. Do the vector dimensions match the index? Is the filter syntax correct?

Stage 2: Filtering (Pre-filtering)

If you have a metadata filter (e.g., user_id = 99), the engine identifies the allowed set of document IDs. In high-performance systems, this uses bitsets or specialized metadata indexes (like Bloom Filters).

Stage 3: The Vector Search

The engine takes the filtered set and navigates the Index Layer. It traverses the HNSW graph or visits the IVF clusters as instructed by parameters like ef or nprobe.

Stage 4: Aggregation and Deduplication

In a sharded database (multiple servers), each server returns its own "Top 5." The Query Engine collects these results, removes any duplicates, and sorts them globally to find the "Global Top 5."

Stage 5: Final Scoring and Re-ranking

Optionally, the query engine can perform a more expensive calculation (like a Cross-Encoder Re-rank) on the final candidates to improve precision.

graph LR
    A[Raw Query] --> B[Filter Stage]
    B --> C[Vector search Stage]
    C --> D[Aggregation Stage]
    D --> E[Re-ranking Stage]
    E --> F[Final Result]

2. Multi-Stage Scoring

A common misconception is that vector databases only use "Vector Similarity" to score results. In production, we often use Hybrid Scoring.

RRF (Reciprocal Rank Fusion)

As discussed in Module 1, the Query Engine can combine scores from:

Semantic score: (0.92)
Keyword score: (0.85)
Freshness score: (Based on date metadata)

The query engine calculates a weighted average:
FinalScore = (0.7 * Semantic) + (0.2 * Keyword) + (0.1 * Freshness)

3. Parallel Execution and Sharding

A single vector search query can be computationally heavy. To handle this, the Query Engine Parallelizes the work at two levels:

Intra-Query Parallelism: The engine uses multiple CPU cores to search different parts of the same index simultaneously.
Inter-Query Parallelism: The engine handles hundreds of users simultaneously by queuing requests and distributing them across a cluster of servers (shards).

4. Query Re-ranking (Two-Pass Search)

Sometimes, the "Approximate" search is too fast and loses detail. To fix this, high-end query engines use a "Two-Pass" architecture:

Pass 1 (Recall): Use a fast, imprecise index (ANN) to find 100 candidates.
Pass 2 (Precision): Use a slow, highly accurate model (Cross-Encoder) to re-score only those 100 items.

Pass 2 is technically a "re-ranker," but modern query engines (like those in OpenSearch or Cohere Rerank) integrate this directly into the query pipeline to simplify developer workflows.

5. Python Concept: Simulating a Query Aggregator

If you have two shards (Server A and Server B), here is how the query engine merges the results in Python logic.

# Simulated results from Shard A and Shard B
shard_a_results = [
    {"id": "doc1", "score": 0.95},
    {"id": "doc2", "score": 0.88},
    {"id": "doc3", "score": 0.70}
]

shard_b_results = [
    {"id": "doc4", "score": 0.99},
    {"id": "doc1", "score": 0.95}, # Duplicated doc
    {"id": "doc5", "score": 0.82}
]

def aggregate_results(shards, top_k=3):
    # 1. Gather all
    all_results = []
    for s in shards:
        all_results.extend(s)
        
    # 2. Deduplicate (Keep highest score for duplicates)
    unique_results = {}
    for res in all_results:
        doc_id = res["id"]
        if doc_id not in unique_results or res["score"] > unique_results[doc_id]["score"]:
            unique_results[doc_id] = res
            
    # 3. Sort by Score
    final_list = sorted(unique_results.values(), key=lambda x: x["score"], reverse=True)
    
    # 4. Trim to Top K
    return final_list[:top_k]

final = aggregate_results([shard_a_results, shard_b_results])
print("Global Top 3 Results:")
for f in final:
    print(f"[{f['score']}] {f['id']}")

6. Query Optimization and Caching

To speed up repeated questions (e.g., "What is the return policy?"), the Query Engine uses two types of caches:

Vector Cache: Storing the most accessed vectors in L1/L2 CPU cache.
Result Cache: Storing the final JSON result of a specific query vector. If a User B asks a question that produces an identical (or very similar) vector to User A, the engine returns the cached result without touching the index.

Summary and Key Takeaways

The Query Engine is the brain that coordinates every stage of search.

Multi-stage Pipeline: Querying is more than just vector search; it's filtering, aggregation, and scoring.
Pre-filtering is handled by the Query Engine to ensure accuracy.
Hybrid Scoring combines semantic, keyword, and metadata signals.
Re-ranking provides high-precision results by refining the top candidates.

In the next lesson, we will look at Metadata Storage, exploring the specialized databases (like RocksDB or Badger) that vector stores use to manage your non-vector data efficiently.

Exercise: Re-ranking Decision

You are building a "Medical Diagnosis Search" for doctors.

Semantic search (Pass 1) takes 50ms.
Re-ranking with a Cross-Encoder (Pass 2) takes 300ms.

When should you use the re-ranker? (Always? Only if the top match is < 0.85? Only if the doctor clicks "High Precision"?)
How does the "two-pass" strategy impact your total API latency?
If you have 1,000 users, how many more servers do you need if everyone uses the re-ranker?

The Query Engine: Executing the Search Pipeline