Hyperparameters: Learning Rate, Batch Size, and Epochs

Hyperparameters: Learning Rate, Batch Size, and Epochs

Mastering the Knobs. Learn how to tune the three most critical parameters of fine-tuning to find the balance between 'Slow Learning' and 'Catastrophic Forgetting'.

Hyperparameters: Mastering the Knobs

Fine-tuning isn't a "set and forget" process. It’s more like cooking. You have the ingredients (data) and the oven (GPU), but you have to control the temperature (Learning Rate), the portion size (Batch Size), and the cooking time (Epochs).

Get these "Hyperparameters" right, and your model will be an expert. Get them wrong, and your model will either be undercooked (it learns nothing) or burnt (it forgets how to speak English because it's so obsessed with your training data).

In this lesson, we will master the three primary knobs of fine-tuning.


1. Learning Rate (The Temperature)

The Learning Rate (LR) is the most important hyperparameter. It determines how large of a step the model takes during gradient descent to update its weights.

  • High Learning Rate ($> 1e-4$): The model learns very fast but often "Overshoots" the target. It’s like a car moving too fast to see the street signs—it misses the turn.
  • Low Learning Rate ($< 1e-6$): The model learns very precisely but very slowly. It might never reach the target because it’s taking microscopic steps.
  • The Sweet Spot: For most fine-tuning tasks (using LoRA/PEFT), the sweet spot is between $5e-5$ and $2e-4$.

Learning Rate Schedulers

We often use a Scheduler to change the LR during training.

  1. Warmup: Start with an LR of 0 and slowly ramp up. This prevents the model from "crashing" when it first see the new data.
  2. Decay: Slowly lower the LR as the model gets closer to the goal, allowing it to perform "Final Polishing."

2. Batch Size (The Portion Size)

The Batch Size is how many examples the model looks at simultaneously before updating its weights.

  • Large Batch Size (32, 64, 128): The training is more stable and faster (on powerful GPUs). The model gets a "clearer" signal of what it should learn.
  • Small Batch Size (1, 2, 4): The training is "Noisy" but can work on small GPUs.

The Gradient Accumulation Hack

If your GPU is too small for a batch size of 16, you can use Gradient Accumulation.

  • You process a batch of 1.
  • You process another 1.
  • ...you do this 16 times.
  • Only then do you update the weights.
  • Result: The mathematical stability of a batch of 16, with the memory requirements of a batch of 1.

3. Epochs (The Cooking Time)

An Epoch is one full pass through the entire training dataset.

  • Too Many Epochs (> 5): This leads to Overfitting. The model begins to memorize specific training examples rather than learning the general pattern.
  • Too Few Epochs (1 - 2): The model hasn't "Internalized" the new style yet. It still sounds like the base model.
  • The General Rule: For small "Golden Datasets" of 100-500 examples, 3 to 5 epochs is usually perfect.

Visualizing Hyperparameter Trade-offs

graph TD
    A["Training Result"] --> B{"Is it underperforming?"}
    
    B -- "Model hasn't changed" --> C["Increase Epochs or Learning Rate"]
    B -- "Model is repetitive/broken" --> D["Decrease Learning Rate or Epochs"]
    B -- "Model is slow to train" --> E["Increase Batch Size"]
    
    subgraph "The Balancing Act"
    C
    D
    E
    end

Implementation: Setting Hyperparameters in Hugging Face

Here is how you define these variables in the TrainingArguments class.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./v1-output",
    
    # 1. Epochs
    num_train_epochs=3,
    
    # 2. Batch Size & Accumulation
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4, # Real Batch Size = 4 * 4 = 16
    
    # 3. Learning Rate & Scheduler
    learning_rate=2e-4,
    warmup_steps=100,
    lr_scheduler_type="cosine", # Smoothly ramp down
    
    # 4. Optimization
    weight_decay=0.01,
    fp16=True, # Use 16-bit for speed
)

The "Overfitting" Test

How do you know if you've tuned your hyperparameters correctly?

  1. Validation Loss: During training, we set aside 10% of our data (The Validation Set).
  2. Training Loss should go down.
  3. Validation Loss should also go down.
  4. If Training Loss keeps going down but Validation Loss starts going UP, you have overfitted. Stop the training immediately.

Summary and Key Takeaways

  • Learning Rate controls the speed and precision of weight updates.
  • Batch Size (plus Gradient Accumulation) controls training stability and memory usage.
  • Epochs determine how many times the model sees the data. 3-5 is the sweet spot.
  • The Goal: Find the lowest Learning Rate and the fewest Epochs that still result in the desired behavior change.

In the next lesson, we will look at the math that justifies these scores: Loss Functions and Gradient Descent.


Reflection Exercise

  1. If you triple your dataset size, should you increase or decrease the number of Epochs? (Hint: Does the model need to see the same information as many times if there's more of it?)
  2. Why is a "Cosine" scheduler usually better than a "Constant" learning rate? (Hint: What happens as we get closer to the destination?)

SEO Metadata & Keywords

Focus Keywords: Fine-Tuning Hyperparameters, Learning Rate for LoRA, gradient accumulation steps explained, batch size vs vram, epochs for sft. Meta Description: Master the knobs of model coaching. Learn how to balance learning rate, batch size, and epochs to achieve high-performance fine-tuning without overfitting or forgetting.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn