
Latency Bottlenecks in RAG
Identify and eliminate the slow points in your multimodal RAG pipeline to ensure a snappy user experience.
Latency Bottlenecks in RAG
A slow RAG system feels like a broken RAG system. If it takes 20 seconds to answer a question, users will stop using it. Performance optimization is the art of shaving milliseconds off every step.
Where is the Time Spent?
1. The Preprocessing Tax (1s - 5s+)
- OCR: This is the slowest part of a multimodal pipeline.
- Video/Audio Slicing: FFmpeg overhead.
- Solution: Perform these steps asynchronously during ingestion and cache the results.
2. Retrieval & Embedding (100ms - 500ms)
- Generating the query embedding takes time.
- Solution: Use smaller, faster embedding models (Module 11.6).
3. Generation (2s - 15s)
- This is usually the largest bottleneck.
- Solution: Use "Streaming" responses so the user sees the first word immediately.
Measuring TTFT (Time to First Token)
In RAG, TTFT is the metric that defines "responsiveness."
TTFT = Preprocessing + Embedding + Retrieval + Initial Generation Overhead
Optimization Checklist
- Parallelize Retrieval: Search multiple collections at the same time.
- Vector Index Optimization: Use HNSW parameters (like
ef_search) to trade accuracy for speed. - CDN / Edge Computing: Place your vector database close to your users.
- Prompt Caching: Reduces the "Context Processing" time for the LLM.
Visualizing the Trace
gantt
title RAG Latency Breakdown
section Retrieval
Query Embedding :a1, 2026-01-01, 200ms
Vector Search :a2, after a1, 50ms
section Generation
Context Assembly :b1, after a2, 100ms
LLM TTFT :b2, after b1, 800ms
Full Streaming :b3, after b2, 4s
Exercises
- Measure the TTFT of your current RAG pipeline. Which step is the slowest?
- If you remove the "Re-ranking" step, how much faster is the query?
- Why does "Model Quantization" improve latency?