
Streaming vs. Batching: Delivery Economics
Master the economics of API delivery. Learn how 'Batch API' can save 50% on your token bill and when to prioritize 'Streaming' for UX efficiency.
Streaming vs. Batching: Delivery Economics
How you receive tokens is as important as how many tokens you receive.
Most developers use Streaming (Server-Sent Events) to give the user a "Real-time" feel. But is streaming always the most efficient way to interact with an LLM? What if the task doesn't require a human to be watching (e.g., an overnight data extraction job)?
In this lesson, we master Inference Delivery Modes. We’ll explore the Batch API (The 50% Discount), Streaming (The Latency King), and the Request/Response (The Standard) models.
1. The Batch API (The Financial Superpower)
If a task can wait for 24 hours (or even 1 hour), you should use the Batch API.
- The Deal: You send a massive file of 1,000 queries. The provider (OpenAI/Anthropic) processes them whenever they have "Spare" GPU capacity.
- The Discount: 50% off all token prices.
Token ROI: Using Batching is the same as finding a 2x efficiency gain in your prompt, but with Zero engineering effort.
2. Streaming: The Perceived Efficiency
Streaming doesn't save tokens, but it saves Developer Time and User Patience.
- Token Efficiency Link: If a user sees a "Mistake" in the first 5 words of a 1,000-word response, they can Cancel the Stream immediately.
- Savings: You save the remaining 995 output tokens. In a non-streaming app, you would have paid for the whole 1,000 tokens before the user could even see the result.
3. Implementation: Batch Request Pattern (Python)
Python Code: Preparing a Batch Job
import json
# 1. Create a JSONL file (One line per request)
# 50% discount applies here!
with open("batch_tasks.jsonl", "w") as f:
for item in items_to_extract:
f.write(json.dumps({
"custom_id": f"task-{item.id}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {"model": "gpt-4o", "messages": [{"role": "user", "content": item.text}]}
}) + "\n")
# 2. Upload and Start Batch
# Result: Success in < 24 hours at 50% price.
4. Comparing the Modes
| Mode | Token Cost | User Latency | Use Case |
|---|---|---|---|
| Streaming | 100% | Ultra-Low | Chatbots, Co-pilots. |
| Req/Res | 100% | High (Wait for total) | Small data extraction. |
| Batch API | 50% | Very High (24hr) | Massive DB processing, ETL. |
5. Decision Logic: The "Offline" Policy
Your application should have a Routing Layer (Module 14.3) for delivery modes.
- If the request is from a Web UI -> Stream.
- If the request is from a Webhook/Crontab -> Batch.
Senior Strategy: 80% of "Agentic" backend work (background research, email drafting, log analysis) can be moved to the Batch API, effectively cutting your operating costs by half.
6. Summary and Key Takeaways
- Batching = 50% Discount: Always use Batch APIs for tasks that aren't time-sensitive.
- Streaming for Early Exit: Use streams to allow users to cancel expensive generations early.
- Queue Architecture: Build a queue system that accumulates smaller tasks into a single large Batch file once a day.
- Latency vs. Liquidity: Choose the mode that balances user experience with financial sustainability.
In the next lesson, Using Speculative Decoding for Speed, we look at چگونه to get the intelligence of a large model at the speed of a small one.
Exercise: The Batch Budgeter
- You have 1,000,000 rows of data to analyze.
- Calculate the cost using standard GPT-4o pricing ($15/M tokens).
- Calculate the cost using the Batch API.
- Determine the 'Time Value':
- Is it worth $7.50 to wait 24 hours?
- Is it worth $7,500.00 to wait 24 hours?
- Conclusion: At scale, Batching is not an option; it is a fiduciary requirement.