Streaming vs. Batching: Delivery Economics

How you receive tokens is as important as how many tokens you receive.

Most developers use Streaming (Server-Sent Events) to give the user a "Real-time" feel. But is streaming always the most efficient way to interact with an LLM? What if the task doesn't require a human to be watching (e.g., an overnight data extraction job)?

In this lesson, we master Inference Delivery Modes. We’ll explore the Batch API (The 50% Discount), Streaming (The Latency King), and the Request/Response (The Standard) models.

1. The Batch API (The Financial Superpower)

If a task can wait for 24 hours (or even 1 hour), you should use the Batch API.

The Deal: You send a massive file of 1,000 queries. The provider (OpenAI/Anthropic) processes them whenever they have "Spare" GPU capacity.
The Discount: 50% off all token prices.

Token ROI: Using Batching is the same as finding a 2x efficiency gain in your prompt, but with Zero engineering effort.

2. Streaming: The Perceived Efficiency

Streaming doesn't save tokens, but it saves Developer Time and User Patience.

Token Efficiency Link: If a user sees a "Mistake" in the first 5 words of a 1,000-word response, they can Cancel the Stream immediately.
Savings: You save the remaining 995 output tokens. In a non-streaming app, you would have paid for the whole 1,000 tokens before the user could even see the result.

3. Implementation: Batch Request Pattern (Python)

Python Code: Preparing a Batch Job

import json

# 1. Create a JSONL file (One line per request)
# 50% discount applies here!
with open("batch_tasks.jsonl", "w") as f:
    for item in items_to_extract:
        f.write(json.dumps({
            "custom_id": f"task-{item.id}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": item.text}]}
        }) + "\n")

# 2. Upload and Start Batch
# Result: Success in < 24 hours at 50% price.

4. Comparing the Modes

Mode	Token Cost	User Latency	Use Case
Streaming	100%	Ultra-Low	Chatbots, Co-pilots.
Req/Res	100%	High (Wait for total)	Small data extraction.
Batch API	50%	Very High (24hr)	Massive DB processing, ETL.

5. Decision Logic: The "Offline" Policy

Your application should have a Routing Layer (Module 14.3) for delivery modes.

If the request is from a Web UI -> Stream.
If the request is from a Webhook/Crontab -> Batch.

Senior Strategy: 80% of "Agentic" backend work (background research, email drafting, log analysis) can be moved to the Batch API, effectively cutting your operating costs by half.

6. Summary and Key Takeaways

Batching = 50% Discount: Always use Batch APIs for tasks that aren't time-sensitive.
Streaming for Early Exit: Use streams to allow users to cancel expensive generations early.
Queue Architecture: Build a queue system that accumulates smaller tasks into a single large Batch file once a day.
Latency vs. Liquidity: Choose the mode that balances user experience with financial sustainability.

In the next lesson, Using Speculative Decoding for Speed, we look at چگونه to get the intelligence of a large model at the speed of a small one.

Exercise: The Batch Budgeter

You have 1,000,000 rows of data to analyze.
Calculate the cost using standard GPT-4o pricing ($15/M tokens).
Calculate the cost using the Batch API.
Determine the 'Time Value':
- Is it worth $7.50 to wait 24 hours?
- Is it worth $7,500.00 to wait 24 hours?
- Conclusion: At scale, Batching is not an option; it is a fiduciary requirement.

Streaming vs. Batching: Delivery Economics

Streaming vs. Batching: Delivery Economics

1. The Batch API (The Financial Superpower)

2. Streaming: The Perceived Efficiency

3. Implementation: Batch Request Pattern (Python)

Python Code: Preparing a Batch Job

4. Comparing the Modes

5. Decision Logic: The "Offline" Policy

6. Summary and Key Takeaways

Exercise: The Batch Budgeter

Congratulations on completing Module 15 Lesson 4! You are now a delivery economist.

Subscribe to our newsletter