
The Quality Filter: Self-Reranking and Query Expansion
Master the techniques for high-precision retrieval. Learn how to use 'LLM-as-a-Judge' to re-rank search results and how to generate multiple search variations for better coverage.
Self-Reranking and Query Expansion
In a production RAG system, your search tool might return 20 "Chunks" of information. Usually, only 2 or 3 of those are actually useful for answering the user's question. If you send all 20 chunks to the LLM, you are wasting tokens and increasing the risk of "Context Pollution."
In this lesson, we will learn two advanced patterns: Query Expansion (finding more stuff) and Self-Reranking (throwing away the junk).
1. Query Expansion (The Multi-Query Pattern)
A user's query is often "Semantically Thin"—it doesn't have enough specific keywords to trigger a good vector match.
The Expansion Loop
- User asks: "How do I fix the error with the database connection?"
- Expansion Node: An LLM generates 3 variations of the query:
- "FastAPI PostgreSQL connection timeout fix"
- "database connection pool exhausted error"
- "psycopg2.OperationalError: could not connect to server"
- Execution: The agent runs all 3 searches in parallel.
- Result: You now have a much broader net of information.
2. Self-Reranking (Filtering for Precision)
Vector databases rank results by "Similarity" (math), but your agent needs "Relevance" (meaning).
The "Reranker" Node
Once the search tool returns 20 results, we pass them through a specialized node (often using a Cross-Encoder model or a cheap LLM like GPT-4o-mini).
The Task: "Look at these 20 snippets. Assign a score from 1-10 on how likely they are to help answer the question: [User Query]. Delete everything with a score below 7."
3. Why Reranking is Mandatory for Accuracy
The model answering the question (The Writer) performs significantly better if it is only given high-density context.
- Without Reranking: 80% noise, 20% signal. The model gets distracted.
- With Reranking: 10% noise, 90% signal. The model is precise and confident.
4. Contextual Compression
A search chunk might be 500 words long, but only the 3rd sentence is relevant. Contextual Compression is the process where a "Compressor Node" reads the 500 words and returns only the 10 most relevant words to the main agent.
Advantage: You can fit 5x more "relevant facts" into the same token window.
5. Implementations: The RAG-Fusion Pattern
RAG-Fusion is the combination of Multi-Query and Reciprocal Rank Fusion (RRF).
- Generate multiple queries.
- Search them all.
- Compute a score based on how high a document appears across all the searches.
- If a document appears in the top 3 results for all 3 queries, it is almost certainly the "Golden Information."
6. Implementation Strategy: LangGraph Flow
graph LR
Input -->|Query Expansion| Q1,Q2,Q3
Q1,Q2,Q3 --> Search[Search DB]
Search -->|20 Results| Rerank[Reranker Node]
Rerank -->|3 Best Results| Final[Writer Agent]
Summary and Mental Model
Think of Query Expansion like Asking 3 friends for a recommendation. You get different perspectives.
Think of Self-Reranking like Reading the back of the books before deciding which ones to check out. You don't read every word; you determine if the book is "in the right ballpark."
Precision in retrieval is the foundation of factual agency.
Exercise: Reranking Design
- The Scoring Logic: You have a search snippet about "Apple (Company)" and the user asked about "Apple (Fruit)".
- How would a "Semantic Reranker" know the difference?
- Draft a 1-sentence prompt for the Reranker to handle this specific ambiguity.
- Efficiency: Why is it cheaper to use a Small Model (like Llama 3 8B) for Reranking rather than GPT-4o?
- The Threshold: If your Reranker node deletes ALL the results (Score < 7 for everyone), what should the graph do next?
- A) Give up.
- B) Go back to the user and ask for clarification.
- C) Try a completely different Search Tool (like Wikipedia). Ready to explore different search types? Next lesson: Vector vs Graph vs Hybrid Search.