Module 4 Wrap-up: The Efficiency Lab

You've learned the theory: Transformers, Quantization, GGUF, and Context. Now, let’s see these concepts in action on your own hardware. We are going to perform a "Stress Test" to find your machine's breaking point.

Hands-on Exercise: The Quantization Test

We are going to compare two versions of the same model: one "Squeezed" (Q4) and one "Heavy" (Q8).

Download the two versions:

ollama pull llama3:8b-instruct-q4_K_M
ollama pull llama3:8b-instruct-q8_0

Measure Disk Space: Run ollama list. Note the size difference. The Q8 model should be roughly double the size of the Q4 model.
Measure RAM/VRAM:
- Open your Task Manager (Windows) or Activity Monitor (Mac).
- Run ollama run llama3:8b-instruct-q4_K_M. Watch the memory spike.
- Close it (/bye).
- Run ollama run llama3:8b-instruct-q8_0. Watch the memory spike again.
- Observed: Did the Q8 model cause your computer to lag?

The "Context Window" Test

Now, let's see how memory grows as you talk.

Start any model.
Press /show info. Look at the current context limit.
Ask the model to: "Write a 2,000-word essay on the history of Rome."
As it types, watch your Memory (VRAM) in your system monitor. Notice how it slowly creeps up as the "Context" fills up.

Module 4 Summary

Transformers manage the "Attention" that makes AI smart.
Quantization makes the models small enough for home computers.
GGUF is the container that holds the model and all its metadata.
Tokens are the units of currency in the AI world.

Coming Up Next...

In Module 5, we move from "Using" to "Creating." We will learn how to write Modelfiles to build your own custom AI personas, bake in system prompts, and change how Ollama behaves forever.

Module 4 Checklist

I can explain the difference between 4-bit and 8-bit quantization.
I know why GPUs are faster than CPUs for Transformers.
I have measured the memory usage of at least one model.
I know how to check the context limit of an active model.
I have seen the difference between "Prompt Eval" and "Eval" speeds.

Module 4 Wrap-up: Measuring Model Efficiency