Latency and Throughput: The Speed of Tokens

In the modern AI landscape, "Fast is the new Smart." A model that takes 60 seconds to answer a simple chat query is useless, no matter how clever the response. As an engineer, your job is to balance Cost (Module 1.4) with Speed.

To do this, you must understand the two metrics that define AI performance: Latency (How long one user waits) and Throughput (How many tokens the system can handle simultaneously).

In this lesson, we will explore the "Physics" of token generation, learn how to measure system bottlenecks, and build a high-performance streaming architecture using FastAPI and React.

1. Defining the Metrics

Latency: The Clock

Latency is the time it takes for a single request to complete. In LLMs, we measure three types of latency:

Time to First Token (TTFT): How long until the user sees the first word? (Crucial for UX).
Time Per Output Token (TPOT): How fast does the model "type" (tokens per second)?
Total Latency: The end-to-end time from clicking "Send" to the final response.

Throughput: The Pipe

Throughput is the total volume of tokens processed by the system across all users.

Measured in Tokens Per Minute (TPM) or Requests Per Minute (RPM).
High throughput is necessary for "Viral" apps with thousands of simultaneous users.

graph LR
    U[User] -->|Prompt| M[Model]
    M -->|Wait...| T1[First Token (TTFT)]
    T1 -->|Streaming...| T2[Full Response]
    
    style T1 fill:#f9f,stroke:#333,stroke-width:4px

2. Why Token Generation is Slow

Generating a token is a sequential mathematical operation.

Input (Prompt): The model reads your whole prompt in one "Parallel" chunk. This is fast (e.g., 200ms for 1,000 tokens).
Output (Generation): The model must run its entire multi-billion parameter neural network again for every single token it produces.

The Bottleneck: Memory Bandwidth. The GPU must fetch billions of numbers from its memory to calculate just one word. This is why "Output Tokens" are the primary cause of latency in your applications.

3. Streaming: Solving the "TTFT" Problem

If a model takes 5 seconds to generate a response, and you wait for the whole block to be ready before showing it to the user, your app feels "Broken."

Streaming allows you to show tokens as they are generated. The user starts reading at 200ms, even if the total 5-second processing time hasn't finished.

Python/FastAPI Implementation: Streaming Tokens

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import boto3
import json

app = FastAPI()
bedrock = boto3.client(service_name='bedrock-runtime')

async def generate_token_stream(prompt):
    body = json.dumps({
        "prompt": f"Human: {prompt}\n\nAssistant:",
        "max_tokens_to_sample": 1000,
        "temperature": 0.5,
    })
    
    # We use invoke_model_with_response_stream for real-time delivery
    response = bedrock.invoke_model_with_response_stream(
        modelId="anthropic.claude-v2", 
        body=body
    )
    
    stream = response.get('body')
    if stream:
        for event in stream:
            chunk = event.get('chunk')
            if chunk:
                # Extract text from the chunk metadata
                decoded = json.loads(chunk.get('bytes').decode())
                yield decoded.get('completion', '')

@app.get("/stream")
async def stream_ai():
    return StreamingResponse(
        generate_token_stream("Write a 500-word essay on AI."), 
        media_type="text/event-stream"
    )

4. Frontend UX: Handling the Stream in React

On the frontend, you must handle an "Observable" stream rather than a single JSON response.

import React, { useState } from 'react';

const StreamReader = () => {
  const [content, setContent] = useState("");

  const startStream = async () => {
    const response = await fetch('/stream');
    const reader = response.body?.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader?.read()!;
      if (done) break;
      
      const chunk = decoder.decode(value, { stream: true });
      setContent((prev) => prev + chunk);
    }
  };

  return (
    <div className="p-6 bg-slate-900 min-h-screen text-slate-100">
      <button 
        onClick={startStream}
        className="px-4 py-2 bg-blue-600 hover:bg-blue-500 rounded transition"
      >
        Generate Essay
      </button>
      
      <div className="mt-6 p-4 border border-slate-700 bg-slate-800 rounded-lg whitespace-pre-wrap">
        {content || "Nothing generated yet..."}
        <span className="w-2 h-4 bg-cyan-400 inline-block animate-pulse ml-1" />
      </div>
    </div>
  );
};

5. Token Density and Rate Limits (TPM vs RPM)

Every API provider (AWS, OpenAI) has Rate Limits.

RPM (Requests Per Minute): Limits how many times you can hit "Send."
TPM (Tokens Per Minute): Limits how "Heavy" those requests are.

The Strategy: A single user dumping a 100,000-token PDF into your app might consume your entire company's TPM budget for the whole minute, causing "429 Too Many Requests" errors for everyone else.

Architectural Solution: Multi-Model Load Balancing

If your primary model (e.g. Claude 3) is at its TPM limit, your backend should automatically "failover" to a secondary model (e.g. Llama 3) on a different provider.

6. Throughput vs. Latency: The GPU Squeeze

There is a technical trade-off: Quantization.

Small, Fast Models (4-bit): Have incredibly high throughput (TPOT > 100 tokens/sec) but slightly lower reasoning accuracy.
Large, Full Models (16-bit): Have lower throughput (< 15 tokens/sec) but superior "Intelligent" output.

Decision Factor: If you are building a real-time translator, Throughput is everything. If you are building a legal analyzer, Latency is acceptable for better accuracy.

7. Summary and Key Takeaways

TTFT is King: Streaming tokens is the single most important UX feature for LLM apps.
Output tokens are the bottleneck: Every word generated requires a full pass through the neural network.
TPM Management: High-volume inputs can block your entire system via rate limits.
Balancing Scale: Use load balancers to distribute token loads across different models and regions (e.g. us-east-1 and us-west-2).

In the next module, Module 2: Where Token Waste Comes From, we will move from understanding metrics to identifying architectural rot. We’ll learn why recursive system prompts and redundant context are the "Silent Killers" of your AI budget.

Exercise: Benchmarking Speed

Open your browser's "Network" tab.
Visit a site like ChatGPT or a local Ollama instance.
Observe when the first byte of text appears (TTFT) vs when the spinner stops (Total Latency).
Calculate the average Tokens Per Second by dividing the word count by the time taken. For most humans, reading speed is about 10-15 tokens per second. Is your AI faster or slower than a human reader?

Latency and Throughput: The Speed of Tokens

Latency and Throughput: The Speed of Tokens

1. Defining the Metrics

Latency: The Clock

Throughput: The Pipe

2. Why Token Generation is Slow

3. Streaming: Solving the "TTFT" Problem

Python/FastAPI Implementation: Streaming Tokens

4. Frontend UX: Handling the Stream in React

5. Token Density and Rate Limits (TPM vs RPM)

Architectural Solution: Multi-Model Load Balancing

6. Throughput vs. Latency: The GPU Squeeze

7. Summary and Key Takeaways

Exercise: Benchmarking Speed

Congratulations on completing Module 1! You are now a master of token physics.

Subscribe to our newsletter