Module 3 Lesson 4: Model Sizes and Variants
Understanding the trade-offs of scale. Why a 70B model is smarter than an 8B model, and why you might not want to use it.
Model Sizes and Variants: Bigger Isn't Always Better
In the world of local LLMs, you are constantly trading off three things: Intelligence, Speed, and VRAM. Understanding "Scaling Laws" will help you decide which model is "good enough" for your specific use case.
The Intelligence Spectrum
The "Tiny" Tier (0.5B - 3B Parameters)
- Examples:
qwen:0.5b,phi3:mini,tinyllama. - Intelligence: High-school level. Good for simple summary and basic formatting.
- Speed: Insanely fast (100+ tokens/sec).
- VRAM: 1-2 GB.
The "Standard" Tier (7B - 14B Parameters)
- Examples:
llama3:8b,mistral:7b,gemma2:9b. - Intelligence: University level. Excellent general reasoning, creative writing, and basic math.
- Speed: Fast (40-60 tokens/sec on good hardware).
- VRAM: 5GB - 10GB.
The "Expert" Tier (30B - 70B Parameters)
- Examples:
llama3:70b,command-r. - Intelligence: PhD level. Complex logical reasoning, nuanced understanding of sarcasm, and large-scale architectural planning.
- Speed: Slow (2-10 tokens/sec).
- VRAM: 24GB - 48GB.
Why Bigger Models Are "Smarter"
A Large Language Model is essentially a complex map of human knowledge.
- An 8B model is like a pocket encyclopedia. It has the facts, but it might miss the subtle connections between them.
- A 70B model is like a library with 1,000 librarians. It understands context, follows complex instructions better, and is much less likely to "hallucinate" (make thing up).
The Speed-to-Brain Tradeoff
The most common mistake is thinking you need the 70B model.
If you are building an AI that categorizes incoming customer support emails as "Billing," "Tech Support," or "Shipping," an 8B model will do it perfectly and finish the task in 200 milliseconds. Running the 70B model for that same task would take 5 seconds and consume 10x the electricity, but the answer would be identical.
Rule of Thumb:
Use the smallest model that can reliably perform the task.
What are "Variants"?
Sometimes you'll see llama3:8b-instruct-fp16 vs llama3:8b-instruct-q4.
- fp16: No compression. 100% of the model's brain.
- q4: 75% compressed. Saves 75% RAM, but loses ~1-2% of its intelligence.
For almost everyone, q4_K_M (the standard Ollama download) is the correct variant.
Summary Table: Which Size for You?
| Task | Minimum Recommended Size | Why? |
|---|---|---|
| Summarizing a short text | 3B | Information is present, just needs reformatting. |
| Friendly Chatbot | 8B | Needs enough "brain" to sound human and not repeat itself. |
| Writing Complex Code | 8B+ (Specialized) | Coding requires strict logic and syntax rules. |
| Legal/Medical Analysis | 70B | Zero tolerance for error; needs high reasoning top-end. |
Key Takeaways
- Intelligence scales with parameter size (B).
- Speed and Memory Usage are inversely proportional to size.
- Quantization (Variants) allow larger models to fit on smaller hardware.
- Select the smallest model that meets your accuracy requirement to maximize performance.