RLHF, DPO, and ORPO: Beyond Supervised Learning

Everything we have done so far in this course has been Supervised Fine-Tuning (SFT). In SFT, we tell the model: "If you see prompt X, say response Y."

But what if there are two correct answers, and one is just "Better" than the other? SFT is poor at handling the "Good vs. Better" distinction. For that, we need Preference Optimization.

Instead of showing the model one perfect answer, we show it a pair:

Chosen ($y_w$): The better response.
Rejected ($y_l$): The worse response.

The model learns to increase the probability of the chosen response and decrease the probability of the rejected one. In this lesson, we will explore the techniques that make modern models so "Human-like."

1. RLHF (The Legacy Giant)

Reinforcement Learning from Human Feedback (RLHF) was the technique that made ChatGPT possible.

Stage 1 (SFT): Standard fine-tuning.
Stage 2 (Reward Model): You train a separate model to "Score" responses based on human preferences.
Stage 3 (PPO): You use Reinforcement Learning to update the student model so it maximizes the "Score" from the reward model.

Why we avoid it today: RLHF is notoriously unstable, hard to tune, and requires three separate models running at once. It’s for people with massive compute budgets.

2. DPO (Direct Preference Optimization)

In late 2023, DPO changed the game. It proved that you can get the same results as RLHF without a separate reward model and without reinforcement learning.

The Magic: DPO turns the preference problem into a simple mathematical classification problem.
Efficiency: It’s as fast and stable as SFT, but provides the safety and alignment of high-end models. Currently, it is the #1 preferred technique for production alignment.

3. ORPO: The One-Step Evolution

The newest technique is ORPO (Odds Ratio Preference Optimization).

SFT + DPO in one: Usually, you have to do SFT first, then DPO.
ORPO does both at the same time. It adds a "Penalty" to the loss function that discourages the model from ever outputting the "Rejected" style while it's learning the "Chosen" style.

Visualizing the Preference Shift

graph TD
    A["User Prompt"] --> B["Base Model Projection"]
    
    subgraph "DPO Training"
    B -- "Chosen (Polite/Long)" --> C["Push Probability UP"]
    B -- "Rejected (Rude/Short)" --> D["Push Probability DOWN"]
    end
    
    C --> E["Aligned Model Behavior"]
    D --> E

Implementation: Setting up DPO with the TRL Library

Using the trl library from Hugging Face, we can set up a DPO trainer in minutes.

from trl import DPOTrainer
from transformers import TrainingArguments

# 1. Dataset must have 'prompt', 'chosen', and 'rejected'
dpo_dataset = [
    {
        "prompt": "Tell me a joke.",
        "chosen": "Why did the robot cross the road? To get to the updated firmware.",
        "rejected": "I don't know any jokes. Go away."
    }
]

# 2. Define the DPO Trainer
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None, # Reference model (usually same as base)
    args=TrainingArguments(
        output_dir="./dpo-output",
        learning_rate=5e-7, # DPO uses VERY small learning rates
        per_device_train_batch_size=4,
    ),
    beta=0.1, # The 'Alignment' strength
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

Summary and Key Takeaways

Choice over Imitation: Preference optimization tells the model what not to do, whereas SFT only tells it what to do.
DPO is the modern industry standard for aligning models without the complexity of RLHF.
Beta ($\beta$): In DPO, the beta value controls how much you want the model to change from its base state. 0.1 is a common starting point.
SFT First: You should always perform SFT before DPO to ensure the model knows the basic language and format before you start fine-tuning its preferences.

In the next lesson, we will go back to the data layer: Handling PII and Sensitive Data during Training.

Reflection Exercise

Why is a learning rate of $5e-7$ (microscopic) used for DPO compared to $2e-4$ for SFT? (Hint: In DPO, we are fine-tuning a model that is already smart; are we trying to teach it brand new things or just 'nudge' its choices?)
If you have a dataset where every "Chosen" answer is exactly 50 words and every "Rejected" answer is 5 words, what will the model learn? Is this a logic improvement or just a 'Verbosity' bias?

SEO Metadata & Keywords

Focus Keywords: what is dpo fine-tuning, RLHF vs DPO, ORPO preference optimization, TRL library tutorial, aligning language models. Meta Description: Move beyond imitation. Learn how RLHF, DPO, and ORPO allow your models to learn from human choices, resulting in smarter, safer, and more helpful AI.