
Real-Time AI: Streaming Responses and SSE
Eliminate the 'Loading' spinner. Learn how to use StreamingResponse and Server-Sent Events (SSE) to stream AI results word-by-word to your users.
Real-Time AI: Streaming Responses and SSE
Nobody likes waiting 10 seconds for an AI to generate a long article. Users prefer to see the text appear "word-by-word," just like in ChatGPT.
In this lesson, we learn how to use StreamingResponse and Generators to build real-time AI experiences.
1. What is a Streaming Response?
Instead of sending one massive JSON object at the end, the server sends a "Stream" of small chunks. The browser can start displaying those chunks while the server is still generating the rest.
2. Using Python Generators
To stream data, we use an Async Generator. It's a function that yields data multiple times over its life.
from fastapi.responses import StreamingResponse
async def ai_streamer(prompt: str):
# Simulate an AI generating words one by one
words = ["FastAPI", "is", "the", "future", "of", "AI", "development."]
for word in words:
yield f"data: {word}\n\n"
await asyncio.sleep(0.5)
@app.get("/stream-ai")
async def stream_ai(prompt: str):
return StreamingResponse(ai_streamer(prompt), media_type="text/event-stream")
3. Server-Sent Events (SSE)
SSE is a standard way to send a one-way stream from the server to the client. Unlike WebSockets (which are two-way and complex), SSE is lightweight and works over standard HTTP.
- Format: Every chunk must start with
data:and end with two newlines\n\n.
4. Streaming from an LLM Client
Both OpenAI and Gemini clients support async streaming out of the box.
async def get_openai_stream(prompt: str):
stream = await client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
stream=True # THE KEY PARAMETER
)
async for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
Visualizing the Stream
sequenceDiagram
participant C as Client (Browser)
participant S as Server (FastAPI)
C->>S: GET /stream-ai
S-->>C: 200 OK (Keep connection open)
Note over S: Generating word 1...
S->>C: "FastAPI"
Note over S: Generating word 2...
S->>C: "is"
Note over S: Generating word 3...
S->>C: "awesome"
Note over C,S: Connection Closed
Summary
StreamingResponse: The FastAPI class for sending data chunks.yield: The keyword that makes streaming possible.- Perceived Performance: Streaming makes your app feel fast, even if the total generation time is the same.
- SSE: The lightweight alternative to WebSockets for AI text streams.
In the next lesson, we wrap up Module 19 with Exercises on AI API engineering.
Exercise: The Stream Architect
You are building an AI Coding Assistant.
- Why is streaming particularly important for code generation compared to standard chat?
- If the user closes the browser tab while the AI is still streaming, does the FastAPI server keep generating text? (Hint: Research how to check for
request.is_disconnected()).