Module 9 Lesson 2: Full Fine-Tuning vs. PEFT

In the early days of LLMs, if you wanted to fine-tune a model, you had to update every single weight inside it. If the model had 70 billion parameters, you needed massive supercomputers to hold all that data in memory. This is called Full Fine-Tuning.

Today, almost no one does that. Instead, we use Parameter-Efficient Fine-Tuning (PEFT). In this lesson, we explore why this change happened and focus on the king of PEFT: LoRA.

1. The Memory Problem

Full Fine-Tuning requires "frozen" copies of the model and "active" copies of every single parameter's gradient. For a 70B model, this could require 1.5 Terabytes of VRAM.

Cost: $10,000+ for a single training run.
Storage: Every time you fine-tune, you have to save a new 140GB file of the weighted model.

2. The LoRA Solution (Low-Rank Adaptation)

LoRA is the clever trick that allows us to fine-tune using only 1% of the memory.

The Post-it Note Analogy: Imagine you have a 1,000-page textbook (The Base Model). You want to update it for a specific law exam.

Full FT: You reprint all 1,000 pages with slight changes on every page.
LoRA: You leave the textbook alone. Instead, you stick small Post-it Notes on the most important pages.

When you read the book, you look at the master page and then peek at the Post-it Note to see the update. Because you only had to write on 10 Post-it Notes (instead of reprinting 1,000 pages), the process is incredibly fast.

graph LR
    Input["Input Data"] --> Base["Frozen Base Model (Matrix W)"]
    Input --> LoRA["Small Adapter (Matrix A + B)"]
    Base --> Output1["Base Result"]
    LoRA --> Output2["Delta (Adjustment)"]
    Output1 & Output2 --> Sum["Final Output"]

3. QLoRA: Fine-Tuning on your Laptop

QLoRA takes this even further by "compressing" the original model (Quantization) to 4 bits while it's in memory.

This allows you to take a massive model like Llama-3-70B and fine-tune it on a single, high-end consumer GPU (like an RTX 3090/4090).

4. Key Benefits of Adapters (LoRA)

Tiny Storage: Instead of saving a 140GB model, you only save a 100MB "Adapter" file.
Swappability: You can keep one base model in memory and "swap" adapters instantly. You could have a "Lawyer Adapter," a "Creative Writing Adapter," and a "Coding Adapter" all using the same underlying hardware.
Stability: Because you aren't changing the base weights, the model is less likely to "Catastrophically Forget" its general knowledge.

Lesson Exercise

Goal: Compare the scale of the two methods.

A model has 100,000,000 parameters.
Full Fine-Tuning: You must update all 100 million.
PEFT (LoRA): You only update 100,000 (0.1%).
If it takes 1 second to update 1,000 parameters, how much time do you save with LoRA?

Observation: The sheer mathematical efficiency of LoRA is what allowed the "Open Source AI" revolution to happen in 2023 and 2024.

Summary

In this lesson, we established:

Full fine-tuning is expensive and memory-intensive.
LoRA (PEFT) allows customization by adding small "adapter" matrices instead of changing the whole model.
Adapters are small, swappable, and incredibly efficient to store and train.

Next Lesson: We look at the "Bread and Butter" of fine-tuning: Datasets. We'll learn how to format your data and why 1,000 great examples are better than 100,000 mediocre ones.