Module 7 Lesson 5: Batch Inference
Processing at scale. How to optimize Ollama for high-volume tasks like document digestion.
Batch Inference: High-Throughput Processing
Most of our use of Ollama has been "Interactive"—one person typing, one person reading. But what if you have 1,000 documents that need to be categorized? This requires Batch Inference.
1. What is "Batching"?
In AI, batching means grouping data together so the GPU can process it in one "swoop." The GPU is incredibly fast at doing math on big chunks of data but slow at moving small chunks back and forth.
Ollama handles this through the num_batch parameter.
2. Tuning num_batch
This parameter determines how many tokens are processed at once during the "Ingestion" phase (when the model is reading your long prompt).
- Default: 512
- High Performance: 2048 or 4096 (If you have a fast GPU).
If you are sending a massive 10,000-word prompt, a higher num_batch will significantly reduce the time you wait for the AI to start its answer.
3. Parallelism vs. Batching
It's important to differentiate:
- Batching (Intra-request): Making one request faster by using more of the GPU for that one prompt.
- Parallelism (Inter-request): Running multiple requests from different users at the exact same time.
Starting with Ollama 0.1.33, you can run multiple requests in parallel by setting OLLAMA_NUM_PARALLEL environment variable (e.g., OLLAMA_NUM_PARALLEL=4). This splits your VRAM into 4 discrete slots so 4 people can chat with 4 models at once.
4. Practical Automation Tips
When processing a batch of files:
- Disable Streaming: Set
"stream": falsein your API call. It's more efficient for bulk data. - Use Embeddings: For categorization, sometimes you don't need a full LLM; you just need to calculate the "Fingerprint" of the text (Module 10).
- Low Temperature: For bulk data extraction, set
temperatureto0.0or0.1to ensure the model doesn't get "bored" and start being creative with your data.
Key Takeaways
- num_batch controls how much of your prompt is read at once.
- Increase
num_batchfor long documents and fast GPUs. - Use
OLLAMA_NUM_PARALLELto host a multi-user AI server locally. - Non-streaming is preferred for automated background tasks.