Real-Time AI: Streaming Responses and SSE

Nobody likes waiting 10 seconds for an AI to generate a long article. Users prefer to see the text appear "word-by-word," just like in ChatGPT.

In this lesson, we learn how to use StreamingResponse and Generators to build real-time AI experiences.

1. What is a Streaming Response?

Instead of sending one massive JSON object at the end, the server sends a "Stream" of small chunks. The browser can start displaying those chunks while the server is still generating the rest.

2. Using Python Generators

To stream data, we use an Async Generator. It's a function that yields data multiple times over its life.

from fastapi.responses import StreamingResponse

async def ai_streamer(prompt: str):
    # Simulate an AI generating words one by one
    words = ["FastAPI", "is", "the", "future", "of", "AI", "development."]
    for word in words:
        yield f"data: {word}\n\n"
        await asyncio.sleep(0.5)

@app.get("/stream-ai")
async def stream_ai(prompt: str):
    return StreamingResponse(ai_streamer(prompt), media_type="text/event-stream")

3. Server-Sent Events (SSE)

SSE is a standard way to send a one-way stream from the server to the client. Unlike WebSockets (which are two-way and complex), SSE is lightweight and works over standard HTTP.

Format: Every chunk must start with data: and end with two newlines \n\n.

4. Streaming from an LLM Client

Both OpenAI and Gemini clients support async streaming out of the box.

async def get_openai_stream(prompt: str):
    stream = await client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True # THE KEY PARAMETER
    )
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

Visualizing the Stream

sequenceDiagram
    participant C as Client (Browser)
    participant S as Server (FastAPI)
    
    C->>S: GET /stream-ai
    S-->>C: 200 OK (Keep connection open)
    Note over S: Generating word 1...
    S->>C: "FastAPI"
    Note over S: Generating word 2...
    S->>C: "is"
    Note over S: Generating word 3...
    S->>C: "awesome"
    Note over C,S: Connection Closed

Summary

StreamingResponse: The FastAPI class for sending data chunks.
yield: The keyword that makes streaming possible.
Perceived Performance: Streaming makes your app feel fast, even if the total generation time is the same.
SSE: The lightweight alternative to WebSockets for AI text streams.

In the next lesson, we wrap up Module 19 with Exercises on AI API engineering.

Exercise: The Stream Architect

You are building an AI Coding Assistant.

Why is streaming particularly important for code generation compared to standard chat?
If the user closes the browser tab while the AI is still streaming, does the FastAPI server keep generating text? (Hint: Research how to check for request.is_disconnected()).