
Low-Latency Scaling: Speed as a Feature
Master the techniques of high-speed AI responses. Learn about KV Caching, Speculative Decoding, and load balancing across multi-GPU clusters to reduce Time-To-First-Token (TTFT).
Low-Latency Scaling: Speed as a Feature
In the world of LLMs, Latency is the number one enemy of user experience. If a user has to wait 3 seconds before the first word appears on the screen, they will perceive the app as "slow" and "buggy."
As an LLM Engineer, your goal is to minimize Time-to-First-Token (TTFT). In this lesson, we will cover the advanced architectural tricks used to scale AI without sacrificing speed.
1. The Two Metrics of Speed
- TTFT (Time-to-First-Token): How long until the model starts "talking." This depends on the prompt length and the model's initialization.
- TPS (Tokens Per Second): How fast the model continues to talk once it has started. This depends on the model's size and GPU bandwidth.
2. KV Caching: Don't Recompute the Past
Every time an LLM picks the next token, it has to look at all previous tokens. If you have a 1,000-word context, the model performs the "Attention" calculation for all 1,000 words every single time it adds a new word.
KV Caching (Key-Value Caching) stores these intermediate mathematical results in the GPU memory.
- Normal: Compute words 1-1000 $\rightarrow$ Get word 1001. Compute words 1-1001 $\rightarrow$ Get word 1002.
- With KV Cache: Load result of 1-1000 from memory $\rightarrow$ Get word 1001. Update cache with 1001 $\rightarrow$ Get word 1002.
Engineer's Note: KV Caching is the reason why LLMs can handle long conversations, but it is also why they consume so much VRAM.
3. Speculative Decoding: The "Predictor" Hack
This is a brilliant technique for speeding up generation. You use a tiny, lightning-fast "Draft Model" to guess the next 5 tokens. You then show those 5 tokens to the "Big Model" and ask: "Are these correct?"
- If the Big Model says "Yes," you just generated 5 tokens for the cost of 1.
- If it says "No," you throw them away and do one slow generation.
graph LR
A[Small Model: Guess 5 tokens] --> B[Big Model: Verify 5 tokens]
B -- Valid --> C[Display all 5 instantly]
B -- Invalid --> D[Generate 1 slow token]
4. Multi-GPU Orchestration
When a model is too big for a single GPU (like Llama 3 70B), we must split it across multiple cards.
- Data Parallelism (DP): Each GPU has a copy of the model. (Used for many small users).
- Tensor Parallelism (TP): Each GPU has a horizontal slice of every layer. (Used to reduce latency for a single large request).
- Pipeline Parallelism (PP): GPU 1 has Layer 1-10, GPU 2 has Layer 11-20. (Used for massive models that don't fit anywhere else).
5. Load Balancing: The Global Router
In a global production app, you don't send all users to one server in Virginia. You use a Load Balancer to route them to the nearest available GPU cluster.
The "Smart Rotation" Strategy: Avoid sending a "Long Context" request (like a 50-page PDF summary) to a server that is currently handling 50 "Short Chat" requests. This prevents "Head-of-Line Blocking," where one slow user makes everyone else wait.
Summary
- TTFT is the most important metric for user perception.
- KV Caching is essential for multi-turn conversations but costs VRAM.
- Speculative Decoding uses small models to speed up large models.
- Parallelism (Tensor/Pipeline) is how we split massive models across multiple GPUs.
In the next lesson, we conclude Module 8 with Continuous Benchmarking, learning how to measure these speeds in real-time.
Exercise: Latency Math
A user sends a 2,000-token prompt.
- The model's "Prompt Processing" speed is 500 tokens/sec.
- The model's "Generation" speed is 20 tokens/sec.
Calculate:
- How many seconds will it take to get the First Token? (TTFT)
- If the answer is 100 words (approx. 150 tokens), how much longer will it take to finish the whole response?
Answer:
- TTFT: $2,000 / 500 = 4.0$ seconds.
- Generation Time: $150 / 20 = 7.5$ seconds. Total: $4.0 + 7.5 = 11.5$ seconds.
If you want to reduce the TTFT, you'd need to upgrade your "Prompt Processing" hardware or use Prompt Caching!