Module 4 Wrap-up: Measuring Model Efficiency
Hands-on: Benchmarking your machine. Compare quantization levels and measure memory usage in real-time.
Module 4 Wrap-up: The Efficiency Lab
You've learned the theory: Transformers, Quantization, GGUF, and Context. Now, let’s see these concepts in action on your own hardware. We are going to perform a "Stress Test" to find your machine's breaking point.
Hands-on Exercise: The Quantization Test
We are going to compare two versions of the same model: one "Squeezed" (Q4) and one "Heavy" (Q8).
- Download the two versions:
ollama pull llama3:8b-instruct-q4_K_M ollama pull llama3:8b-instruct-q8_0 - Measure Disk Space:
Run
ollama list. Note the size difference. The Q8 model should be roughly double the size of the Q4 model. - Measure RAM/VRAM:
- Open your Task Manager (Windows) or Activity Monitor (Mac).
- Run
ollama run llama3:8b-instruct-q4_K_M. Watch the memory spike. - Close it (
/bye). - Run
ollama run llama3:8b-instruct-q8_0. Watch the memory spike again. - Observed: Did the Q8 model cause your computer to lag?
The "Context Window" Test
Now, let's see how memory grows as you talk.
- Start any model.
- Press
/show info. Look at the current context limit. - Ask the model to: "Write a 2,000-word essay on the history of Rome."
- As it types, watch your Memory (VRAM) in your system monitor. Notice how it slowly creeps up as the "Context" fills up.
Module 4 Summary
- Transformers manage the "Attention" that makes AI smart.
- Quantization makes the models small enough for home computers.
- GGUF is the container that holds the model and all its metadata.
- Tokens are the units of currency in the AI world.
Coming Up Next...
In Module 5, we move from "Using" to "Creating." We will learn how to write Modelfiles to build your own custom AI personas, bake in system prompts, and change how Ollama behaves forever.
Module 4 Checklist
- I can explain the difference between 4-bit and 8-bit quantization.
- I know why GPUs are faster than CPUs for Transformers.
- I have measured the memory usage of at least one model.
- I know how to check the context limit of an active model.
- I have seen the difference between "Prompt Eval" and "Eval" speeds.