Module 8 Lesson 2: Streaming API Responses
Words as they happen. How to handle NDJSON streams in your application for a professional AI feel.
Streaming the API: The Pro User Experience
If you set "stream": true (the default) in your API call, Ollama won't send you one big JSON object. Instead, it will send a continuous stream of tiny JSON objects.
This technically uses a format called NDJSON (Newline Delimited JSON).
1. Why Steam?
- User Delight: Seeing the "Thinking" process makes the app feel faster.
- Lower Memory Usage: Your app doesn't have to hold a 5,000-word response in memory before showing it.
2. What the Stream Looks Like
Each line of the response is its own valid JSON object:
{"model":"llama3","created_at":"...","response":"The","done":false}
{"model":"llama3","created_at":"...","response":" sky","done":false}
{"model":"llama3","created_at":"...","response":" is","done":false}
{"model":"llama3","created_at":"...","response":" blue.","done":true}
3. How to Process the Stream (Logic)
In your code, you cannot use a simple json.parse() on the whole response body. Instead, you must:
- Open a "Readable Stream."
- Listen for daily chunks of data.
- Split the data by the newline character (
\n). - Parse each individual line.
- Append the
.responsetext to your UI.
4. The Final Object
The very last object in the stream ("done": true) is special. It doesn't just contain the last word; it contains the Statistics:
total_duration: Total time taken.load_duration: How long it took to read the model from disk.sample_count: How many tokens were generated.eval_count: How many tokens per second (the speed metric).
This allows your app to show a little footer saying: "Generated 500 words in 10 seconds (50 t/s)."
Key Takeaways
- Streaming is mandatory for user-facing chat apps.
- The data format is NDJSON (one JSON object per line).
- Your code must use an asynchronous reader to process chunks as they arrive.
- The final object contains critical performance metrics for your app logs.