Performance Trade-offs: The Engineer's Choice

As we wrap up our look at model internals, we need to talk about the "Three-Way Tug of War" that every local AI engineer manages. When you change one setting in Ollama, it affects the other two.

1. The Quality vs. Size Trade-off

The Lever: Quantization (Q4 vs Q8)

Increase Quality: Move to Q8 or FP16.
The Cost: You need 2x to 4x more VRAM. If you run out of VRAM, the model moves to the CPU, and your speed drops by 90%.
The Recommendation: Stick to Q4_K_M. The loss in quality is nearly invisible to humans, but the speed gain is massive.

2. The Speed vs. Context Trade-off

The Lever: num_ctx (Context Window)

Increase Context: Allow the model to "remember" a 10,000-line codebase.
The Cost: Every token of context requires "KV Cache" space in VRAM. Large context windows slow down the "Time to First Token" and can cause the system to crash if you over-allocate.
The Recommendation: Use 8,192 for general chat. Only increase to 32k+ when performing specific document analysis.

3. The Batch Size Trade-off

The Lever: num_batch

What it is: How many tokens the model processes at once.
Increase Batch: Good for processing a big document (Prompting).
The Cost: Higher peak VRAM usage.
The Recommendation: Let Ollama handle this automatically. It is highly optimized out of the box.

How to Benchmark Your Setup

If you want to know how your machine is performing, you can look at the "hidden" metadata of an Ollama response.

Run a prompt and look for these two metrics:

Prompt Eval Count: How fast the model reads your input. (Measured in tokens/sec).
Eval Count: How fast the model types the answer. (Measured in tokens/sec).

Goal: You want your "Eval Count" to be above 10 tokens/sec for it to feel comfortable to read. If it's below 5, it's time to use a smaller model or lower quantization.

Summary Matrix

To Get More...	Do This:	But Watch Out For:
Speed	Use a smaller model (3B)	Lower Reasoning Ability
Accuracy	Use higher quantization (Q8)	Slower Loading, More VRAM
Memory Info	Increase `num_ctx`	Slower Response, Higher Crash risk
Stability	Use `num_gpu 0` (CPU only)	Extreme Slowness

Key Takeaways

Engineering is the art of balancing quality against hardware limits.
Q4_K_M is the "Sweet Spot" for 90% of users.
Always monitor your Tokens per second to ensure a good user experience.
Don't over-allocate Context unless you actually need it for the specific task.

Module 4 Lesson 6: Performance Trade-offs