Module 3 Lesson 6: Streaming Responses
Words as they happen. Why streaming is the secret to a fast-feeling AI application.
Streaming Responses: The Secret to Instant AI
Have you ever used an AI app where you wait for 10 seconds, and then the entire paragraph suddenly appears? That is a "Non-streaming" request. It feels slow and frustrating.
Streaming is the technique used to send words to the screen as they are generated. It is the default behavior of Ollama, and it is the key to a professional user experience.
Time to First Token (TTFT)
The most important metric in AI performance isn't how long the whole answer takes; it's how long the first word takes.
- Total Time: 10 seconds.
- TTFT: 0.2 seconds.
If a user sees the first word start to appear within 200ms, they perceive the app as "instant," even if it takes another 9.8 seconds to finish the sentence.
How Streaming Works in the CLI
When you run ollama run llama3, you are seeing streaming in action. Notice how the letters dance across the screen? Ollama is receiving a stream of data from the local server and rendering it bit by bit.
How Streaming Works in Code (API)
When you make a request to the Ollama API, the response is sent as a series of JSON objects, one for each "token" (word or partial word).
{"model":"llama3", "created_at":"...", "response":"The", "done":false}
{"model":"llama3", "created_at":"...", "response":" sky", "done":false}
{"model":"llama3", "created_at":"...", "response":" is", "done":false}
{"model":"llama3", "created_at":"...", "response":" blue.", "done":true}
By reading this stream as it arrives, your application can update the UI immediately.
When to DISABLE Streaming
There are rare cases where you might want to wait for the whole answer:
- JSON Processing: If you want to transform a response into a table or a chart, you need the whole string before you can parse it as JSON.
- Server-to-Server: If your Python script is calling Ollama to save an answer to a database, you don't need to see it "stream" into the database.
In these cases, you add "stream": false to your API request.
Why Local Streaming is Faster
In the cloud (OpenAI), your words have to travel thousands of miles through fiber optic cables. With Ollama, the words only have to travel from your CPU/GPU to your screen. This means local streaming often has a lower TTFT than even the world's most powerful cloud models.
Key Takeaways
- Streaming allows users to read as the model thinks.
- Time to First Token (TTFT) is the critical metric for AI UX.
- By default, Ollama streams every response.
- Only disable streaming when you need to parse the final result (like JSON) before showing it to the user.