Inference Optimization: Quantization and Pruning

You've built the world's smartest AI agent. But if it takes 30 seconds to respond and costs $2.00 per query, nobody will use it. As an LLM Engineer, your final duty is Optimization. You must make your models as small, fast, and cheap as possible without breaking their "Brain."

In this lesson, we cover the two primary techniques for shrinking models: Quantization and Pruning.

1. Quantization: Reducing Precision

The weights in a model are floating-point numbers. In their raw form, they are FP32 (32-bit), meaning each number is stored with high precision.

Quantization is the process of rounding these numbers to a lower precision, like INT8 (8-bit) or even INT4 (4-bit).

The Analogy:

Imagine you have a high-resolution photo.

FP32: A 50-megapixel RAW file. (Beautiful, but huge).
INT4: A compressed JPEG. (Looks almost the same to the human eye, but 100x smaller).

graph LR
    A["FP32 Weight (32 bits)"] --> B[Quantization Process]
    B --> C["INT4 Weight (4 bits)"]
    C --> D[Result: 8x Memory Reduction]

Why it matters:

An 8-billion parameter model in 16-bit takes ~16GB of VRAM. In 4-bit (using quantization like GGUF or EXL2), it takes only ~5GB. This allows you to run "Smart" models on "Cheaper" hardware.

2. Pruning: Cutting Dead Weights

A Large Language Model is built on "Sparsity." Not every neuron in a 70B parameter model is useful for every task. Pruning is the process of identifying "redundant" or "zero-value" weights and removing them entirely from the network.

Weight Pruning: Removing individual weights.
Structured Pruning: Removing entire layers or rows of neurons.

Result: A pruned model is faster at "Inference" (the forward pass) because the GPU has fewer math operations to perform.

3. The Performance Trade-off

Optimization is never free. There is always a balance between Efficiency and Perplexity (Intelligence).

Precision	Model Size	Speed	Accuracy Loss
FP16	100%	Baseline	0%
INT8	50%	2x Faster	<1%
INT4	25%	4x Faster	1-3%
INT2	12%	8x Faster	High (Model becomes "Dumb")

LLM Engineer Strategy: Most production systems use INT4 or INT8. It provides the best "Bang for your Buck" in terms of speed vs. intelligence.

4. Popular Formats You Must Know

When you browse Hugging Face, you will see these terms. Here is what they mean:

GGUF: Designed by the llama.cpp team. Amazing for running models on CPU/Apple Silicon.
GPTQ: A classic for GPU-based quantization.
AWQ: (Activation-aware Weight Quantization). A newer format that is even more accurate than GPTQ at 4-bit.

Code Concept: Loading a Quantized Model in Python

Using the bitsandbytes library, you can quantize a model "on the fly" as you load it into memory.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 1. Define the 4-bit config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4" # NormalFloat4 is a specific Q-logic
)

# 2. Load the model (It will use 75% less VRAM instantly)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=quant_config
)

Summary

Quantization is the "compression" of the math (FP16 $\rightarrow$ INT4).
Pruning is the "removal" of useless parts of the brain.
Goal: Run larger models on smaller GPUs to save costs and reduce latency.
Industry Standard: 4-bit quantization (AWQ/GGUF) is the "sweet spot" for production.

In the next lesson, we will look at Model Serving, learning how to wrap these optimized models in high-speed APIs like vLLM.

Exercise: The Budget Architect

You have a budget for a single 16GB VRAM GPU. You want to run a 30-billion parameter model.

30B parameters in FP16 require 60GB of VRAM.
30B parameters in INT4 require 15GB of VRAM.

Question:

Can you run this model on your hardware?
If so, which precision MUST you use?

Answer: Yes, you can run it, but you MUST use INT4. This illustrates why quantization is an "Enablement" technology—it makes the impossible possible.