Low-Latency Scaling: Speed as a Feature

In the world of LLMs, Latency is the number one enemy of user experience. If a user has to wait 3 seconds before the first word appears on the screen, they will perceive the app as "slow" and "buggy."

As an LLM Engineer, your goal is to minimize Time-to-First-Token (TTFT). In this lesson, we will cover the advanced architectural tricks used to scale AI without sacrificing speed.

1. The Two Metrics of Speed

TTFT (Time-to-First-Token): How long until the model starts "talking." This depends on the prompt length and the model's initialization.
TPS (Tokens Per Second): How fast the model continues to talk once it has started. This depends on the model's size and GPU bandwidth.

2. KV Caching: Don't Recompute the Past

Every time an LLM picks the next token, it has to look at all previous tokens. If you have a 1,000-word context, the model performs the "Attention" calculation for all 1,000 words every single time it adds a new word.

KV Caching (Key-Value Caching) stores these intermediate mathematical results in the GPU memory.

Normal: Compute words 1-1000 $\rightarrow$ Get word 1001. Compute words 1-1001 $\rightarrow$ Get word 1002.
With KV Cache: Load result of 1-1000 from memory $\rightarrow$ Get word 1001. Update cache with 1001 $\rightarrow$ Get word 1002.

Engineer's Note: KV Caching is the reason why LLMs can handle long conversations, but it is also why they consume so much VRAM.

3. Speculative Decoding: The "Predictor" Hack

This is a brilliant technique for speeding up generation. You use a tiny, lightning-fast "Draft Model" to guess the next 5 tokens. You then show those 5 tokens to the "Big Model" and ask: "Are these correct?"

If the Big Model says "Yes," you just generated 5 tokens for the cost of 1.
If it says "No," you throw them away and do one slow generation.

graph LR
    A[Small Model: Guess 5 tokens] --> B[Big Model: Verify 5 tokens]
    B -- Valid --> C[Display all 5 instantly]
    B -- Invalid --> D[Generate 1 slow token]

4. Multi-GPU Orchestration

When a model is too big for a single GPU (like Llama 3 70B), we must split it across multiple cards.

Data Parallelism (DP): Each GPU has a copy of the model. (Used for many small users).
Tensor Parallelism (TP): Each GPU has a horizontal slice of every layer. (Used to reduce latency for a single large request).
Pipeline Parallelism (PP): GPU 1 has Layer 1-10, GPU 2 has Layer 11-20. (Used for massive models that don't fit anywhere else).

5. Load Balancing: The Global Router

In a global production app, you don't send all users to one server in Virginia. You use a Load Balancer to route them to the nearest available GPU cluster.

The "Smart Rotation" Strategy: Avoid sending a "Long Context" request (like a 50-page PDF summary) to a server that is currently handling 50 "Short Chat" requests. This prevents "Head-of-Line Blocking," where one slow user makes everyone else wait.

Summary

TTFT is the most important metric for user perception.
KV Caching is essential for multi-turn conversations but costs VRAM.
Speculative Decoding uses small models to speed up large models.
Parallelism (Tensor/Pipeline) is how we split massive models across multiple GPUs.

In the next lesson, we conclude Module 8 with Continuous Benchmarking, learning how to measure these speeds in real-time.

Exercise: Latency Math

A user sends a 2,000-token prompt.

The model's "Prompt Processing" speed is 500 tokens/sec.
The model's "Generation" speed is 20 tokens/sec.

Calculate:

How many seconds will it take to get the First Token? (TTFT)
If the answer is 100 words (approx. 150 tokens), how much longer will it take to finish the whole response?

Answer:

TTFT: $2,000 / 500 = 4.0$ seconds.
Generation Time: $150 / 20 = 7.5$ seconds. Total: $4.0 + 7.5 = 11.5$ seconds.

If you want to reduce the TTFT, you'd need to upgrade your "Prompt Processing" hardware or use Prompt Caching!