Module 2 Lesson 3: Streaming Responses
Zero Latency UX. How to use LangChain's .stream() method to display text as it's being generated.
Streaming: Eliminating the Wait
In Module 1, we used .invoke(). This method waits until the model has finished its entire answer before returning it. For long answers, the user might wait 10-20 seconds in silence. Streaming fixes this by delivering the answer "token by token."
1. The UX Advantage
Without streaming, the user sees a spinner. With streaming, the user sees text appearing immediately, making the application feel responsive and "Alive."
2. Using the .stream() Generator
Instead of one big object, .stream() returns a Python "Generator" that yields "Chunks" of the message.
# Instead of response = model.invoke(...)
chunks = []
for chunk in model.stream("Write a long poem about the ocean."):
# Print each piece immediately without a newline
print(chunk.content, end="", flush=True)
chunks.append(chunk)
3. Chunks vs. Messages
- A Chunk (
ChatGenerationChunk) is just a fragment of a message. - Chunks can be added together to create a final
AIMessage. final_message = chunks[0] + chunks[1] + ...
4. Visualizing the Byte Stream
sequenceDiagram
participant U as User (UI)
participant L as LangChain
participant A as OpenAI API
U->>L: Invoke prompt
L->>A: Start generation
A-->>L: Token: 'The'
L-->>U: Show 'The'
A-->>L: Token: 'Capital'
L-->>U: Show 'Capital'
A-->>L: [Stream Finished]
5. Why Not Always Stream?
Streaming is great for Chat, but it can be annoying for:
- Backend logs: You don't want 500 lines of log for one sentence.
- Structured Data (JSON): You can't parse half a JSON object. You usually wait for the full block before converting it to a Python dictionary.
Key Takeaways
.stream()reduces the "Time to First Token" (TTFT).- It uses Python generators for efficient memory handling.
- Streaming is primarily a UX/Frontend improvement.
- Chunks must be aggregated if you need the full message after the stream ends.