
Advanced RAG: Re-ranking, Context Pruning, and the Two-Stage Retrieval
Take your RAG system to production. Learn how to use Cross-Encoders for re-ranking and how to optimize context to save on LLM token costs.
Advanced RAG: The Search for Precision
In the previous lesson, we built a basic RAG pipeline. It works, but it has a problem: Noise. A vector search might find 10 documents that are "similar" to the query, but only 2 of them actually contain the answer. If you send all 10 to the LLM, you are:
- Wasting Money: Paying for unnecessary tokens.
- Confusing the Model: Increasing the risk that the model focuses on the wrong data.
- Increasing Latency: Larger prompts take longer to process.
In this lesson, we explore Advanced RAG. We will learn about Re-ranking, Context Pruning, and the Query Expansion techniques that separate a hobbyist demo from a production-grade AI agent.
1. The Two-Stage Retrieval Pattern
In production, we don't just "Search and Prompt." we use two stages:
Stage 1: Retrieval (Recall)
We use a fast, cheap Bi-Encoder (like OpenAI embeddings) to find 20-50 candidate documents from a billion-vector store.
Stage 2: Re-ranking (Precision)
We use a slow, expensive Cross-Encoder (like Cohere Rerank or BGE-Reranker) to evaluate the relationship between the query and each of those 50 candidates. The Cross-Encoder outputs a score of exactly how relevant each document is. We then take the Top 5 for the final prompt.
graph TD
Q[Query] -->|Search| VDB[(Vector DB)]
VDB -->|Top 50| R[Re-ranker Model]
R -->|Final Top 5| LLM[Large Language Model]
2. Context Pruning: Less is More
Sometimes, even a single chunk has "Noise" (headers, footers, irrelevant sentences).
Context Pruning involves:
- Summarization: Using a smaller, cheaper LLM to summarize the retrieved snippets before sending them to the expensive main model.
- Key-Value Extraction: Only extracting specific entities (names, dates, prices) from the snippets.
- Sentence Filtering: Using semantic similarity to delete sentences within a chunk that don't relate to the query.
3. Query Expansion and Rewriting
Users are bad at asking questions. If a user asks "Tell me about that thing with the taxes last year," a raw vector search might struggle.
The "Query Transform" Chain:
- Take the user's messy query.
- Send it to an LLM to generate 3 better versions of the query.
- "New tax laws 2024"
- "Corporate tax amendments"
- "Tax filing deadlines"
- Search the vector database for all 3 versions.
- Combine the results.
This is called Multi-Query Retrieval and significantly improves the odds of finding the right facts.
4. HYDE: Hypothetical Document Embeddings
HYDE is a clever trick for RAG:
- Instead of searching with the query, you ask an LLM to write a fake answer to the query.
- You create an embedding of that fake answer.
- You search the database for "Real documents that look like this fake answer."
Why it works: A vector of a question often looks different from a vector of an answer. But a vector of a fake answer looks almost exactly like the real answer.
5. Python Example: Implementing Cohere Re-rank
Let's look at how to integrate a re-ranker into your search logic.
import cohere
co = cohere.Client('YOUR_API_KEY')
def advanced_rag_search(query, index):
# 1. STAGE 1: Standard Vector Search
initial_results = index.query(query, n_results=50) # Get a large list
docs = [res['text'] for res in initial_results]
# 2. STAGE 2: Re-rank with Cohere
# This evaluates query vs document context specifically
rerank_hits = co.rerank(
query=query,
documents=docs,
top_n=5,
model='rerank-english-v3.0'
)
# 3. Get the high-quality final docs
final_context = [docs[hit.index] for hit in rerank_hits.results]
return final_context
# Now you send final_context to your LLM prompt
6. The "Missing Middle" Problem
LLMs are better at remembering information at the beginning or the end of a prompt. If the most important fact is buried in Chunk 3 out of 5, the model might forget it.
Advanced Tip: After re-ranking, organize your chunks so the #1 most relevant chunk is first, and the #2 most relevant is last. This "brackets" the information for the LLM.
Summary and Key Takeaways
Precision is the difference between an AI that works and an AI that is trusted.
- Re-ranking provides a 10-20% boost in search accuracy by using deeper models.
- Query Expansion solves the problem of "Vague User Input."
- HYDE uses the LLM's imagination to find actual documents.
- Context Management: Be concise. Don't drown the LLM in "Noise."
In the next lesson, we will look at Building a RAG system with LangChain or LlamaIndex, exploring the frameworks that automate these advanced patterns.
Exercise: Re-ranking Budget
- A Vector Search (Stage 1) costs $0.001.
- A Re-ranker (Stage 2) costs $0.01.
- LLM Tokens for 50 docs cost $0.50.
- LLM Tokens for 5 re-ranked docs cost $0.05.
Calculate the total cost of a query:
- Without Re-ranking (Stage 1 + 50 docs worth of tokens).
- With Re-ranking (Stage 1 + Stage 2 + 5 docs worth of tokens).