Concurrency: Multi-User Local AI

By default, Ollama is "Single Threaded." If User A is asking a long question, User B has to wait in a queue for User A to finish. This is unacceptable if you are building an AI tool for a team.

Starting with Ollama 0.1.33, you can unlock Parallelism.

1. `OLLAMA_NUM_PARALLEL`

This environment variable tells Ollama how many "Workers" to keep active.

If you set OLLAMA_NUM_PARALLEL=4:

Ollama will carve your VRAM into 4 discrete slots.
4 separate prompts can be processed at the exact same time.
The Cost: You need 4x the VRAM for the "KV Cache" (Short term memory) of the model.

2. `OLLAMA_MAX_LOADED_MODELS`

What if User A wants Llama 3 but User B wants Mistral? By default, Ollama might try to "Swap" them, which is slow. Set OLLAMA_MAX_LOADED_MODELS=2 to keep both models "Hot" in RAM at the same time.

3. The Performance Trade-off

Parallelism isn't free.

Speed: If 4 people are chatting, each person will get tokens at 1/4th the speed of a single user.
Memory: You might run out of VRAM faster.

Recommendation: Only use parallelism if you have a massive amount of VRAM (24GB+) or are using very small models (3B or 1B).

4. How to set it (Linux/Windows)

Linux (systemd): systemctl edit ollama.service Add Environment="OLLAMA_NUM_PARALLEL=4"
Windows: Set a System Environment Variable through the Control Panel.

5. Identifying Bottlenecks

If your num_parallel is high but users are still reporting slowness, check your CPU and RAM. Running multiple requests increases the "Pre-fill" phase, which is heavy on the CPU and memory bandwidth.

Key Takeaways

Ollama can handle Parallel Requests with the correct configuration.
OLLAMA_NUM_PARALLEL determines how many users can chat simultaneously.
OLLAMA_MAX_LOADED_MODELS allows multiple different models to stay in RAM.
Each parallel slot requires additional VRAM for the context window.

Module 13 Lesson 3: Concurrency and Parallelism

Concurrency: Multi-User Local AI

1. `OLLAMA_NUM_PARALLEL`

2. `OLLAMA_MAX_LOADED_MODELS`

3. The Performance Trade-off

4. How to set it (Linux/Windows)

5. Identifying Bottlenecks

Key Takeaways

Subscribe to our newsletter

Concurrency: Multi-User Local AI

1. OLLAMA_NUM_PARALLEL

2. OLLAMA_MAX_LOADED_MODELS

3. The Performance Trade-off

4. How to set it (Linux/Windows)

5. Identifying Bottlenecks

Key Takeaways

Subscribe to our newsletter

1. `OLLAMA_NUM_PARALLEL`

2. `OLLAMA_MAX_LOADED_MODELS`