Model Sizes and Variants: Bigger Isn't Always Better

In the world of local LLMs, you are constantly trading off three things: Intelligence, Speed, and VRAM. Understanding "Scaling Laws" will help you decide which model is "good enough" for your specific use case.

The Intelligence Spectrum

The "Tiny" Tier (0.5B - 3B Parameters)

Examples: qwen:0.5b, phi3:mini, tinyllama.
Intelligence: High-school level. Good for simple summary and basic formatting.
Speed: Insanely fast (100+ tokens/sec).
VRAM: 1-2 GB.

The "Standard" Tier (7B - 14B Parameters)

Examples: llama3:8b, mistral:7b, gemma2:9b.
Intelligence: University level. Excellent general reasoning, creative writing, and basic math.
Speed: Fast (40-60 tokens/sec on good hardware).
VRAM: 5GB - 10GB.

The "Expert" Tier (30B - 70B Parameters)

Examples: llama3:70b, command-r.
Intelligence: PhD level. Complex logical reasoning, nuanced understanding of sarcasm, and large-scale architectural planning.
Speed: Slow (2-10 tokens/sec).
VRAM: 24GB - 48GB.

Why Bigger Models Are "Smarter"

A Large Language Model is essentially a complex map of human knowledge.

An 8B model is like a pocket encyclopedia. It has the facts, but it might miss the subtle connections between them.
A 70B model is like a library with 1,000 librarians. It understands context, follows complex instructions better, and is much less likely to "hallucinate" (make thing up).

The Speed-to-Brain Tradeoff

The most common mistake is thinking you need the 70B model.

If you are building an AI that categorizes incoming customer support emails as "Billing," "Tech Support," or "Shipping," an 8B model will do it perfectly and finish the task in 200 milliseconds. Running the 70B model for that same task would take 5 seconds and consume 10x the electricity, but the answer would be identical.

Rule of Thumb:

Use the smallest model that can reliably perform the task.

What are "Variants"?

Sometimes you'll see llama3:8b-instruct-fp16 vs llama3:8b-instruct-q4.

fp16: No compression. 100% of the model's brain.
q4: 75% compressed. Saves 75% RAM, but loses ~1-2% of its intelligence.

For almost everyone, q4_K_M (the standard Ollama download) is the correct variant.

Summary Table: Which Size for You?

Task	Minimum Recommended Size	Why?
Summarizing a short text	3B	Information is present, just needs reformatting.
Friendly Chatbot	8B	Needs enough "brain" to sound human and not repeat itself.
Writing Complex Code	8B+ (Specialized)	Coding requires strict logic and syntax rules.
Legal/Medical Analysis	70B	Zero tolerance for error; needs high reasoning top-end.

Key Takeaways

Intelligence scales with parameter size (B).
Speed and Memory Usage are inversely proportional to size.
Quantization (Variants) allow larger models to fit on smaller hardware.
Select the smallest model that meets your accuracy requirement to maximize performance.

Module 3 Lesson 4: Model Sizes and Variants