The Retrieval Loop: Query Analysis to Injection

RAG is a three-stage dance: You must understand what the user is really asking, find the facts, and then present those facts to the LLM in a way that doesn't confuse it.

In this lesson, we cover the advanced techniques for Query Understanding and Context Injection.

1. Query Understanding: Fixing the User's Question

Users are often brief or vague. If a user asks "How much?" after a long conversation about a travel plan, a vector search for "How much?" will return garbage results.

The Solution: Query Rewriting (Query Decomposition)

Before searching the database, we use a "Cheap" LLM call to rewrite the user's question into a "Standalone Question" based on the chat history.

User: "How much?"
Chat History: Discussing a trip to Hawaii in July.
Rewritten Search Query: "What is the total cost of a 7-day trip to Hawaii in July including airfare and hotel?"

graph LR
    A[User Query] --> B[LLM: Rewrite Query]
    B --> C[Vector Search: Standalone Query]
    C --> D[Retrieve Chunks]

2. Advanced Retrieval: Beyond Top-K

Sometimes, the "Top-1" result is not enough. You might need to retrieve 10 results and then use a specialized model to Re-rank them.

Re-Ranking

Search engines are fast but "loose." A Re-ranker (like Cohere Rerank) is slower but "accurate." You fetch 20 documents with a fast vector search and then use the Re-ranker to pick the 3 that are actually relevant.

3. Context Injection: Designing the Prompt

Now that you have your 3 text chunks, how do you give them to the LLM?

The "System Context" Pattern

You should never just paste the text into the user's chat bubble. Instead, inject it into the System Prompt as a "Knowledge Base."

Professional Template:

Role: You are a helpful assistant for ABC Corp.

Rules for Grounding:
1. Use ONLY the 'PROVIDED CONTEXT' below to answer the user's question.
2. If the answer is not in the context, say "I don't have that information."
3. Cite the source using [Source ID].

### PROVIDED CONTEXT ###
[ID 1]: {chunk_1_text}
[ID 2]: {chunk_2_text}
### END CONTEXT ###

4. Avoiding "Lost in the Middle"

Researchers have found that LLMs are very good at following instructions at the very beginning and very end of a prompt, but they often ignore information in the middle.

LLM Engineer Strategy: Place your most important instructions (like "Don't hallucinate") at the very end of the prompt, right before the model is supposed to start generating.

5. Handling Token Limits gracefully

What if you retrieve 5 chunks but each one is 1,000 tokens, and your model only supports a small window?

The "Map-Reduce" Strategy:

Map: Ask the LLM to summarize each chunk individually.
Reduce: Use those summaries as the final context for the user's answer.

Code Concept: Context Assembly with Python

def assemble_prompt(query, chunks):
    # Format chunks into a numbered list
    context_block = "\n".join([f"[{i+1}] {c.text}" for i, c in enumerate(chunks)])
    
    # Construct the final prompt
    prompt = f"""
    Answer the question using the context below. 
    If unsure, say you don't know.
    
    ### CONTEXT ###
    {context_block}
    
    ### QUESTION ###
    {query}
    
    Answer:
    """
    return prompt

Summary of Module 5

Concept: RAG is an Open-Book exam for AI (5.1).
Ingestion: Clean, parse, and chunk your data with metadata (5.2).
Storage: Use Vector DBs for semantic similarity (5.3).
Retrieval: Rewrite queries and inject context into structured system prompts (5.4).

You have now mastered the most powerful way to ground AI in reality. In the next module, we will explore Fine-Tuning, the alternative (and complementary) strategy for specializing a model's behavior.

Exercise: The Hallucination Hunt

You build a RAG system and a user asks: "What is Bob's phone number?" The system returns a chunk about Bob's address and his email, but NOT his phone number. The LLM then replies: "Bob's phone number is 555-0199."

Identify the error:

Did retrieval fail?
Did the prompt fail?
How would you adjust the prompt to stop this hallucination?

Hint: Look at Rule #2 in the context injection pattern above. The model ignored the "Stay Grounded" instruction!