Module 13 Lesson 3: Concurrency and Parallelism
Serving the crowd. How to configure Ollama to handle multiple concurrent user requests.
Concurrency: Multi-User Local AI
By default, Ollama is "Single Threaded." If User A is asking a long question, User B has to wait in a queue for User A to finish. This is unacceptable if you are building an AI tool for a team.
Starting with Ollama 0.1.33, you can unlock Parallelism.
1. OLLAMA_NUM_PARALLEL
This environment variable tells Ollama how many "Workers" to keep active.
If you set OLLAMA_NUM_PARALLEL=4:
- Ollama will carve your VRAM into 4 discrete slots.
- 4 separate prompts can be processed at the exact same time.
- The Cost: You need 4x the VRAM for the "KV Cache" (Short term memory) of the model.
2. OLLAMA_MAX_LOADED_MODELS
What if User A wants Llama 3 but User B wants Mistral? By default, Ollama might try to "Swap" them, which is slow.
Set OLLAMA_MAX_LOADED_MODELS=2 to keep both models "Hot" in RAM at the same time.
3. The Performance Trade-off
Parallelism isn't free.
- Speed: If 4 people are chatting, each person will get tokens at 1/4th the speed of a single user.
- Memory: You might run out of VRAM faster.
Recommendation: Only use parallelism if you have a massive amount of VRAM (24GB+) or are using very small models (3B or 1B).
4. How to set it (Linux/Windows)
- Linux (systemd):
systemctl edit ollama.serviceAddEnvironment="OLLAMA_NUM_PARALLEL=4" - Windows: Set a System Environment Variable through the Control Panel.
5. Identifying Bottlenecks
If your num_parallel is high but users are still reporting slowness, check your CPU and RAM. Running multiple requests increases the "Pre-fill" phase, which is heavy on the CPU and memory bandwidth.
Key Takeaways
- Ollama can handle Parallel Requests with the correct configuration.
- OLLAMA_NUM_PARALLEL determines how many users can chat simultaneously.
- OLLAMA_MAX_LOADED_MODELS allows multiple different models to stay in RAM.
- Each parallel slot requires additional VRAM for the context window.