Streaming the API: The Pro User Experience

If you set "stream": true (the default) in your API call, Ollama won't send you one big JSON object. Instead, it will send a continuous stream of tiny JSON objects.

This technically uses a format called NDJSON (Newline Delimited JSON).

1. Why Steam?

User Delight: Seeing the "Thinking" process makes the app feel faster.
Lower Memory Usage: Your app doesn't have to hold a 5,000-word response in memory before showing it.

2. What the Stream Looks Like

Each line of the response is its own valid JSON object:

{"model":"llama3","created_at":"...","response":"The","done":false}
{"model":"llama3","created_at":"...","response":" sky","done":false}
{"model":"llama3","created_at":"...","response":" is","done":false}
{"model":"llama3","created_at":"...","response":" blue.","done":true}

3. How to Process the Stream (Logic)

In your code, you cannot use a simple json.parse() on the whole response body. Instead, you must:

Open a "Readable Stream."
Listen for daily chunks of data.
Split the data by the newline character (\n).
Parse each individual line.
Append the .response text to your UI.

4. The Final Object

The very last object in the stream ("done": true) is special. It doesn't just contain the last word; it contains the Statistics:

total_duration: Total time taken.
load_duration: How long it took to read the model from disk.
sample_count: How many tokens were generated.
eval_count: How many tokens per second (the speed metric).

This allows your app to show a little footer saying: "Generated 500 words in 10 seconds (50 t/s)."

Key Takeaways

Streaming is mandatory for user-facing chat apps.
The data format is NDJSON (one JSON object per line).
Your code must use an asynchronous reader to process chunks as they arrive.
The final object contains critical performance metrics for your app logs.

Module 8 Lesson 2: Streaming API Responses