Latency Bottlenecks in RAG

A slow RAG system feels like a broken RAG system. If it takes 20 seconds to answer a question, users will stop using it. Performance optimization is the art of shaving milliseconds off every step.

Where is the Time Spent?

1. The Preprocessing Tax (1s - 5s+)

OCR: This is the slowest part of a multimodal pipeline.
Video/Audio Slicing: FFmpeg overhead.
Solution: Perform these steps asynchronously during ingestion and cache the results.

2. Retrieval & Embedding (100ms - 500ms)

Generating the query embedding takes time.
Solution: Use smaller, faster embedding models (Module 11.6).

3. Generation (2s - 15s)

This is usually the largest bottleneck.
Solution: Use "Streaming" responses so the user sees the first word immediately.

Measuring TTFT (Time to First Token)

In RAG, TTFT is the metric that defines "responsiveness." TTFT = Preprocessing + Embedding + Retrieval + Initial Generation Overhead

Optimization Checklist

Parallelize Retrieval: Search multiple collections at the same time.
Vector Index Optimization: Use HNSW parameters (like ef_search) to trade accuracy for speed.
CDN / Edge Computing: Place your vector database close to your users.
Prompt Caching: Reduces the "Context Processing" time for the LLM.

Visualizing the Trace

gantt
    title RAG Latency Breakdown
    section Retrieval
    Query Embedding    :a1, 2026-01-01, 200ms
    Vector Search      :a2, after a1, 50ms
    section Generation
    Context Assembly   :b1, after a2, 100ms
    LLM TTFT           :b2, after b1, 800ms
    Full Streaming     :b3, after b2, 4s

Exercises

Measure the TTFT of your current RAG pipeline. Which step is the slowest?
If you remove the "Re-ranking" step, how much faster is the query?
Why does "Model Quantization" improve latency?