Module 6 Lesson 5: Quantization Options
Going deep on compression. Exploring the technical differences between Q4_0, Q4_K_M, and GQA.
Quantization Options: Picking Your Depth
When you download a GGUF from Hugging Face, you'll see a list of dozens of files. They all have codes like Q4_0, Q4_K_S, Q5_K_M. If you pick the wrong one, the model might be too slow or too stupid.
Let's break down the naming convention so you can choose with confidence.
1. The Legacy Formats (Avoid these)
- Q4_0 / Q4_1: These are the original quantization methods. They are fast but lose more intelligence than the newer versions. Only use these if you have a very old CPU.
2. The "K-Quants" (Use these!)
These are the modern standard. They use a technique where important parts of the model (called "Weights") are kept high-res, while less important parts are squeezed hard.
- Q4_K_M (Medium): The Best Choice. It has almost no perceptible loss in quality but fits an 8B model into ~5GB.
- Q4_K_S (Small): Slightly smaller than "M," but you might start to notice the model missing nuanced instructions.
- Q5_K_M: If you have the RAM, this is the "Sweet Spot" for maximum intelligence. It feels almost identical to the uncompressed model.
3. The "Extreme" Quants
- Q2_K / Q3_K: These make the model tiny. A 70B model (usually 140GB) can fit into 20GB.
- Trade-off: The model will "stutter" or give very short, bland answers.
- Best for: Small chat bots that only need to do very basic tasks on low-end hardware.
4. Summary Selection Guide
| If you have... | Choose this Quant: | Reason |
|---|---|---|
| Budget RAM (8GB) | Q4_K_M | Best balance of everything. |
| High-end GPU (24GB+) | Q6_K or Q8_0 | Absolute maximum accuracy. |
| Old Laptop (4GB) | Q2_K or Q3_K_L | Only way it will run. |
| Headless Server | Q5_K_M | High accuracy for production tasks. |
5. How to Quantize Your Own
If you converted a model to F16 in the last lesson, you can use the llama-quantize tool (from llama.cpp) to compress it:
./llama-quantize my-model-f16.gguf my-model-q4_k_m.gguf Q4_K_M
This will take a few minutes and create a new, smaller file that is ready for Ollama.
Key Takeaways
- K-Quants (
_K_M) are superior to legacy formats (_0). - Q4_K_M is the default for almost all Ollama registry models.
- Q5_K_M provides a noticeable "smartness" boost if you have the extra 1-2GB of RAM.
- Avoid Q2 unless you have no other choice; the "Intelligence Cliff" is real.