Batch Inference: High-Throughput Processing

Most of our use of Ollama has been "Interactive"—one person typing, one person reading. But what if you have 1,000 documents that need to be categorized? This requires Batch Inference.

1. What is "Batching"?

In AI, batching means grouping data together so the GPU can process it in one "swoop." The GPU is incredibly fast at doing math on big chunks of data but slow at moving small chunks back and forth.

Ollama handles this through the num_batch parameter.

2. Tuning `num_batch`

This parameter determines how many tokens are processed at once during the "Ingestion" phase (when the model is reading your long prompt).

Default: 512
High Performance: 2048 or 4096 (If you have a fast GPU).

If you are sending a massive 10,000-word prompt, a higher num_batch will significantly reduce the time you wait for the AI to start its answer.

3. Parallelism vs. Batching

It's important to differentiate:

Batching (Intra-request): Making one request faster by using more of the GPU for that one prompt.
Parallelism (Inter-request): Running multiple requests from different users at the exact same time.

Starting with Ollama 0.1.33, you can run multiple requests in parallel by setting OLLAMA_NUM_PARALLEL environment variable (e.g., OLLAMA_NUM_PARALLEL=4). This splits your VRAM into 4 discrete slots so 4 people can chat with 4 models at once.

4. Practical Automation Tips

When processing a batch of files:

Disable Streaming: Set "stream": false in your API call. It's more efficient for bulk data.
Use Embeddings: For categorization, sometimes you don't need a full LLM; you just need to calculate the "Fingerprint" of the text (Module 10).
Low Temperature: For bulk data extraction, set temperature to 0.0 or 0.1 to ensure the model doesn't get "bored" and start being creative with your data.

Key Takeaways

num_batch controls how much of your prompt is read at once.
Increase num_batch for long documents and fast GPUs.
Use OLLAMA_NUM_PARALLEL to host a multi-user AI server locally.
Non-streaming is preferred for automated background tasks.

Module 7 Lesson 5: Batch Inference

Batch Inference: High-Throughput Processing

1. What is "Batching"?

2. Tuning `num_batch`

3. Parallelism vs. Batching

4. Practical Automation Tips

Key Takeaways

Subscribe to our newsletter

Batch Inference: High-Throughput Processing

1. What is "Batching"?

2. Tuning num_batch

3. Parallelism vs. Batching

4. Practical Automation Tips

Key Takeaways

Subscribe to our newsletter

2. Tuning `num_batch`