Module 7 Wrap-up: The Performance Lab

You have toured the "Engine Room" of your local AI. You know how to manage disk space, optimize memory, tune the context window, and process data in bulk. Now, let’s see the real-world impact of these changes.

Hands-on Exercise: The "Context Stress Test"

We are going to find the exact point where your machine runs out of steam.

1. The Small Start

Create a Modelfile called LiteBot:

FROM llama3
PARAMETER num_ctx 2048

Run it. Measure the "Time to first token."

2. The Heavy Load

Create a Modelfile called HeavyBot:

FROM llama3
PARAMETER num_ctx 32768

Run it. Now, paste a very large document (e.g., a 10-page academic paper or code file) and ask for a summary.

Observation: Note the delay before the model starts typing. Watch your VRAM usage.

3. The Resolution

If HeavyBot crashed or was 10x slower than LiteBot, you have found your Hardware Context Ceiling. For your specific machine, you now know that you should stay below that number for stable production work.

Module 7 Summary

Caching keeps models "hot" in RAM for instant use.
The OLLAMA_MODELS variable is your friend for moving data to external drives.
VRAM optimization requires closing other high-graphical apps.
Context window tuning is the most effective way to balance memory and utility.
Batch processing is boosted by the num_batch and OLLAMA_NUM_PARALLEL settings.

Coming Up Next...

In Module 8, we finally start building apps. We will connect Ollama to Python, JavaScript, and LangChain to build the foundations of a "Private Local AI Platform."

Module 7 Checklist

I have used ollama ps to see which models are in RAM.
I checked my OLLAMA_MODELS path and know where my space is going.
I tested the speed difference between small (2k) and large (16k) context windows.
I closed my GPU-heavy apps and saw an increase in tokens-per-second.
I can explain the keep_alive parameter to a teammate.

Module 7 Wrap-up: The Optimization Challenge