Module 4 Lesson 6: Performance Trade-offs
Optimization 101. Balancing speed vs quality vs memory in your local AI setup.
Performance Trade-offs: The Engineer's Choice
As we wrap up our look at model internals, we need to talk about the "Three-Way Tug of War" that every local AI engineer manages. When you change one setting in Ollama, it affects the other two.
1. The Quality vs. Size Trade-off
The Lever: Quantization (Q4 vs Q8)
- Increase Quality: Move to Q8 or FP16.
- The Cost: You need 2x to 4x more VRAM. If you run out of VRAM, the model moves to the CPU, and your speed drops by 90%.
- The Recommendation: Stick to Q4_K_M. The loss in quality is nearly invisible to humans, but the speed gain is massive.
2. The Speed vs. Context Trade-off
The Lever: num_ctx (Context Window)
- Increase Context: Allow the model to "remember" a 10,000-line codebase.
- The Cost: Every token of context requires "KV Cache" space in VRAM. Large context windows slow down the "Time to First Token" and can cause the system to crash if you over-allocate.
- The Recommendation: Use 8,192 for general chat. Only increase to 32k+ when performing specific document analysis.
3. The Batch Size Trade-off
The Lever: num_batch
- What it is: How many tokens the model processes at once.
- Increase Batch: Good for processing a big document (Prompting).
- The Cost: Higher peak VRAM usage.
- The Recommendation: Let Ollama handle this automatically. It is highly optimized out of the box.
How to Benchmark Your Setup
If you want to know how your machine is performing, you can look at the "hidden" metadata of an Ollama response.
Run a prompt and look for these two metrics:
- Prompt Eval Count: How fast the model reads your input. (Measured in tokens/sec).
- Eval Count: How fast the model types the answer. (Measured in tokens/sec).
Goal: You want your "Eval Count" to be above 10 tokens/sec for it to feel comfortable to read. If it's below 5, it's time to use a smaller model or lower quantization.
Summary Matrix
| To Get More... | Do This: | But Watch Out For: |
|---|---|---|
| Speed | Use a smaller model (3B) | Lower Reasoning Ability |
| Accuracy | Use higher quantization (Q8) | Slower Loading, More VRAM |
| Memory Info | Increase num_ctx | Slower Response, Higher Crash risk |
| Stability | Use num_gpu 0 (CPU only) | Extreme Slowness |
Key Takeaways
- Engineering is the art of balancing quality against hardware limits.
- Q4_K_M is the "Sweet Spot" for 90% of users.
- Always monitor your Tokens per second to ensure a good user experience.
- Don't over-allocate Context unless you actually need it for the specific task.