Hardware Requirements and Optimization

Running local agents is a "Gears" problem. You need to match the weight of your model (The Gears) to the power of your hardware (The Engine). If you try to run a 70B model on a 4GB GPU, it won't just be slow—it won't run at all.

In this lesson, we will look at the specific hardware requirements for agentic systems and the mathematical trick called Quantization that allows us to run large models on small devices.

1. The "VRAM First" Rule

In AI, Video RAM (VRAM) is more important than CPU speed.

When an LLM runs, it loads its entire "Mind" (the weights) into the VRAM.
If the weights are 10GB and your GPU has 8GB, Ollama will try to push the extra 2GB into your system RAM (DDR4/5), which is 10-100x slower. This is the primary reason for "0.5 tokens per second" performance.

2. Quantization: The Great Compressor

Quantization is the process of reducing the precision of the model's weights.

Original (FP16): Every weight is a 16-bit number. (High precision, large file).
Quantized (Q4_K_M): Every weight is compressed to a 4-bit number. (Memory usage drops by 70%, but intelligence drops by only ~1-2%).

Key Insight: You should always aim for 4-bit or 5-bit quantization. It is the "Sweet Spot" for local agency.

An 8B model (Full size: 16GB) becomes 4.8GB at 4-bit quantization. It can now fit on almost any modern laptop.

3. Recommended Hardware Configurations

Level	Goal	Hardware
Entry	Simple 8B Agents	MacBook Air (M1/16GB) or PC with RTX 3060 (12GB VRAM).
Professional	High-Speed 70B Agents	MacBook Pro (M3 Max / 64GB+) or PC with 2x RTX 3090.
Enterprise	Multi-Agent Swarms	Dedicated Server with 8x NVIDIA A100 or H100.

4. Optimization: Context Pruning (KV Cache)

Every token in the conversation history consumes VRAM.

If you have an agent with a 32k context window, that cache alone can take up 4-8GB of VRAM in addition to the model weights.
Optimization: Use Flash Attention (built into Ollama) and aggressively prune your history (Module 3.3).

5. System Optimization (OS Level)

If you are running agents on a Linux or Windows machine:

Disable Video Output: If your GPU is also driving your 4K monitor, you are "wasting" 1-2GB of VRAM on colors and windows.
Increase Swap Space: This prevents the system from crashing if the agent goes slightly over the memory limit.
Cooling: Local agents can run the GPU at 100% for minutes at a time. Ensure you have high-quality fans or water cooling to prevent Thermal Throttling.

6. Determining the "Perfect" Model Size

Use this formula: VRAM needed = (Parameter Count * Quantization Bits / 8) + 1GB buffer.

Example: Llama-3 70B @ 4-bit.

70 * 4 / 8 = 35GB.
You need at least 36GB of VRAM to run this model. (A 4090 with 24GB is not enough!).

Summary and Mental Model

Think of the LLM as a Digital Waterbed.

Quantization is like sucking the air out to make it smaller.
The GPU is the frame. If the bed is too big for the frame, it spills over and makes a mess (System Latency).

Measure your frame before you buy your bed.

Exercise: Hardware Audit

The Math: You have an NVIDIA RTX 4070 with 12GB of VRAM.
- Can you run an 8B model? (How much VRAM is left for context?)
- Can you run a 34B model at 4-bit? (Do the math!)
Strategy: Why is a Mac Studio with 128GB of RAM often better for "Deep Reasoning" agents than a PC with a single $2,000 GPU?
Troubleshooting: If your agent starts at 50 tokens per second but slows down to 2 tokens per second after 10 messages, what is happening?
- (Hint: Look up "Context Overfill" and "KV Cache"). You've mastered the hardware. Now, let's look at the "Soul" of the agent: Long Term Memory.

The Agent Engine: Hardware and Optimization