The RAG Pipeline: Chunking, Embedding, and Prompting

The RAG Pipeline: Chunking, Embedding, and Prompting

Master the step-by-step workflow of RAG. Learn how to optimize document chunking, choose embedding models, and craft prompts that prevent LLM hallucinations.

The RAG Pipeline: From Raw Data to Accurate Answers

Building a RAG (Retrieval-Augmented Generation) system is easy. Building a good RAG system is a multi-step engineering challenge. Every link in the chain—how you split the text, which model you use, and how you format the final instructions—affects the quality of the answer.

In this lesson, we break down the three core phases of the RAG pipeline and learn the "Production Levers" you can pull to increase accuracy and reduce costs.


1. Phase 1: Chunking (The Foundation)

LLMs have a Context Window (the amount of text they can process at once). You cannot send a 1,000-page book as context. You must break the book into Chunks.

The Three Chunking Strategies:

  1. Fixed-size: Every 500 characters. (Fast but splits words in half).
  2. Recursive / Semantic: Splits by paragraphs and then sentences. (Prevents loss of meaning).
  3. Overlapping: Every chunk includes 10% of the previous chunk. (Ensures context isn't lost at the boundaries).

Rule of Thumb: For most business docs, use 500-1000 tokens with a 10% overlap.


2. Phase 2: Embedding (The Search)

Once you have your chunks, you convert them to vectors. As we have seen throughout this course, the choice of Embedding Model defines your search quality.

  • Ingestion: Store the chunks in Pinecone/Chroma.
  • Retrieval: When the user asks a question, embed the question using the same model.
graph TD
    DOC[Document] --> CH[Chunker]
    CH --> |Text| EMB[Embedding Model]
    EMB --> |Vector| VDB[(Vector DB)]
    U[Query] --> EMB
    VDB -.-> |Similarity| SN[Snippets]

3. Phase 3: Prompting (The Generation)

This is where the retrieved "Snippets" are turned into a conversation. A production RAG prompt has a specific structure:

  1. The Role: "You are a specialized support assistant."
  2. The Constraints: "Use ONLY the context provided. If the answer is not there, say you don't know."
  3. The Data: "Here are the facts: {retrieved_chunks}"
  4. The Task: "Answer this question: {user_query}"

4. The "Sliding Window" Problem

If you retrieve 5 chunks, but the answer requires information from Chunk 1 and Chunk 10, the LLM might miss the connection.

Production Tip: "Parent-Document Retrieval"

  1. You store small chunks (for fast, accurate search) in the vector database.
  2. The metadata contains a reference to the "Parent Doc ID."
  3. When you find a small chunk, you retrieve the larger surrounding paragraph to give the LLM better context.

5. Python Example: A Professional RAG Implementation

Let's use a conceptual RAG flow that handles chunking and prompting.

import chromadb

# 1. SETUP
client = chromadb.Client()
collection = client.create_collection("hr_docs")

# 2. CHUNKING & INGESTION
raw_text = "Your 5000 word document..."
chunks = [raw_text[i:i+1000] for i in range(0, len(raw_text), 900)] # 10% overlap

collection.add(
    documents=chunks,
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)

# 3. THE RAG FUNCTION
def ask_ai(question):
    # Retrieve
    results = collection.query(query_texts=[question], n_results=3)
    context = "\n---\n".join(results['documents'][0])
    
    # Prompt (Augment)
    prompt = f"""
    Answer the user question based on the context snippets.
    
    CONTEXT:
    {context}
    
    QUESTION: {question}
    
    ANSWER:"""
    
    # Call your LLM here
    # response = llm_client.complete(prompt)
    return prompt # Returning prompt for demonstration

print(ask_ai("How many vacation days do I get?"))

6. Evaluating Quality (The RAG Triad)

How do you know if your pipeline is working? Look at Three Metrics:

  1. Faithfulness: Is the answer derived only from the context?
  2. Answer Relevance: Does the answer actually address the user's question?
  3. Context Precision: Were the retrieved chunks actually relevant to the question?

Summary and Key Takeaways

The RAG pipeline is a chain where every link matters.

  1. Chunking must balance length and meaning (use overlap!).
  2. Embedding must be consistent across ingestion and search.
  3. Prompting must strictly constrain the LLM to the provided facts.
  4. Context Window management is the key to preventing "Prompt Stuffing" errors.

In the next lesson, we will look at Advanced RAG, exploring Re-ranking and Context Pruning to make your answers even more precise and cost-effective.


Exercise: Pipeline Design

  1. You are building a RAG bot for "Long Legal Contracts."
  2. One paragraph might say "The fee is $500" and another might say "...except in cases where Section 4 applies."
  • Would you use a small chunk size (100 words) or a large one (1000 words)?
  • Why is Overlap critical here?
  • If the RAG bot answers "$500" but misses the Section 4 exception, is that a Retrieval failure or a Generation failure?

Congratulations on completing Module 10 Lesson 2! You're building real AI systems.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn