Advanced RAG: Re-ranking, Context Pruning, and the Two-Stage Retrieval

Advanced RAG: Re-ranking, Context Pruning, and the Two-Stage Retrieval

Take your RAG system to production. Learn how to use Cross-Encoders for re-ranking and how to optimize context to save on LLM token costs.

Advanced RAG: The Search for Precision

In the previous lesson, we built a basic RAG pipeline. It works, but it has a problem: Noise. A vector search might find 10 documents that are "similar" to the query, but only 2 of them actually contain the answer. If you send all 10 to the LLM, you are:

  1. Wasting Money: Paying for unnecessary tokens.
  2. Confusing the Model: Increasing the risk that the model focuses on the wrong data.
  3. Increasing Latency: Larger prompts take longer to process.

In this lesson, we explore Advanced RAG. We will learn about Re-ranking, Context Pruning, and the Query Expansion techniques that separate a hobbyist demo from a production-grade AI agent.


1. The Two-Stage Retrieval Pattern

In production, we don't just "Search and Prompt." we use two stages:

Stage 1: Retrieval (Recall)

We use a fast, cheap Bi-Encoder (like OpenAI embeddings) to find 20-50 candidate documents from a billion-vector store.

Stage 2: Re-ranking (Precision)

We use a slow, expensive Cross-Encoder (like Cohere Rerank or BGE-Reranker) to evaluate the relationship between the query and each of those 50 candidates. The Cross-Encoder outputs a score of exactly how relevant each document is. We then take the Top 5 for the final prompt.

graph TD
    Q[Query] -->|Search| VDB[(Vector DB)]
    VDB -->|Top 50| R[Re-ranker Model]
    R -->|Final Top 5| LLM[Large Language Model]

2. Context Pruning: Less is More

Sometimes, even a single chunk has "Noise" (headers, footers, irrelevant sentences).

Context Pruning involves:

  1. Summarization: Using a smaller, cheaper LLM to summarize the retrieved snippets before sending them to the expensive main model.
  2. Key-Value Extraction: Only extracting specific entities (names, dates, prices) from the snippets.
  3. Sentence Filtering: Using semantic similarity to delete sentences within a chunk that don't relate to the query.

3. Query Expansion and Rewriting

Users are bad at asking questions. If a user asks "Tell me about that thing with the taxes last year," a raw vector search might struggle.

The "Query Transform" Chain:

  1. Take the user's messy query.
  2. Send it to an LLM to generate 3 better versions of the query.
    • "New tax laws 2024"
    • "Corporate tax amendments"
    • "Tax filing deadlines"
  3. Search the vector database for all 3 versions.
  4. Combine the results.

This is called Multi-Query Retrieval and significantly improves the odds of finding the right facts.


4. HYDE: Hypothetical Document Embeddings

HYDE is a clever trick for RAG:

  1. Instead of searching with the query, you ask an LLM to write a fake answer to the query.
  2. You create an embedding of that fake answer.
  3. You search the database for "Real documents that look like this fake answer."

Why it works: A vector of a question often looks different from a vector of an answer. But a vector of a fake answer looks almost exactly like the real answer.


5. Python Example: Implementing Cohere Re-rank

Let's look at how to integrate a re-ranker into your search logic.

import cohere

co = cohere.Client('YOUR_API_KEY')

def advanced_rag_search(query, index):
    # 1. STAGE 1: Standard Vector Search
    initial_results = index.query(query, n_results=50) # Get a large list
    docs = [res['text'] for res in initial_results]
    
    # 2. STAGE 2: Re-rank with Cohere
    # This evaluates query vs document context specifically
    rerank_hits = co.rerank(
        query=query, 
        documents=docs, 
        top_n=5, 
        model='rerank-english-v3.0'
    )
    
    # 3. Get the high-quality final docs
    final_context = [docs[hit.index] for hit in rerank_hits.results]
    return final_context

# Now you send final_context to your LLM prompt

6. The "Missing Middle" Problem

LLMs are better at remembering information at the beginning or the end of a prompt. If the most important fact is buried in Chunk 3 out of 5, the model might forget it.

Advanced Tip: After re-ranking, organize your chunks so the #1 most relevant chunk is first, and the #2 most relevant is last. This "brackets" the information for the LLM.


Summary and Key Takeaways

Precision is the difference between an AI that works and an AI that is trusted.

  1. Re-ranking provides a 10-20% boost in search accuracy by using deeper models.
  2. Query Expansion solves the problem of "Vague User Input."
  3. HYDE uses the LLM's imagination to find actual documents.
  4. Context Management: Be concise. Don't drown the LLM in "Noise."

In the next lesson, we will look at Building a RAG system with LangChain or LlamaIndex, exploring the frameworks that automate these advanced patterns.


Exercise: Re-ranking Budget

  1. A Vector Search (Stage 1) costs $0.001.
  2. A Re-ranker (Stage 2) costs $0.01.
  3. LLM Tokens for 50 docs cost $0.50.
  4. LLM Tokens for 5 re-ranked docs cost $0.05.

Calculate the total cost of a query:

  • Without Re-ranking (Stage 1 + 50 docs worth of tokens).
  • With Re-ranking (Stage 1 + Stage 2 + 5 docs worth of tokens).

Why is re-ranking not just about accuracy, but also about Unit Economics?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn