Dataset Preparation: The Fuel for Fine-Tuning

A model is only as good as the data it trains on. In fine-tuning, Quality > Quantity. You will get better results from 500 gold-standard examples than from 50,000 messy ones. As an LLM Engineer, your biggest challenge isn't "running the training script"; it's Dataset Curation.

In this lesson, we will cover formatting, synthetically generating data, and the pitfalls of bad datasets.

1. Instruction Tuning Format (Alpaca vs. ShareGPT)

To fine-tune a model to follow instructions, your data must follow a specific pattern. The two most common formats are:

Alpaca Style:

Best for "Instruction-Response" tasks.

{
  "instruction": "Identify the chemical formula.",
  "input": "Water",
  "output": "H2O"
}

ShareGPT Style:

Best for "Conversational / Multi-turn" agents.

{
  "conversations": [
    {"from": "human", "value": "Hi, who are you?"},
    {"from": "gpt", "value": "I am your AI research assistant."}
  ]
}

2. Synthetic Data Generation (The "AI-to-AI" Hack)

What if you don't have 1,000 examples of your "Grumpy Pirate" persona? You can use a more powerful model (like GPT-4o) to generate a training dataset for a smaller model (like Llama 3 8B).

The Workflow:

Write a complex, 2,000-word prompt for GPT-4o explaining the persona.
Feed GPT-4o 100 raw questions and ask it to answer as the persona.
Save those pairs into a JSONL file.
Clean and Audit: Manually check 20% of the results to ensure the style is perfect.
Fine-tune your small model on this "Synthetic Gold."

3. The Dangers of Fine-Tuning Data

A. Catastrophic Forgetting

If you train a model only on your medical data, it might become an amazing doctor but "forget" how to perform basic math or follow simple English instructions. The Fix: Use "Replay Buffers." Mix some general-purpose instruction data (like the OpenOrca dataset) into your specialized training mix.

B. Overfitting to Noise

If your training data contains typos or formatting errors, the model will learn them as "Rules." The Fix: Use Pydantic or Regex to validate your synthetic data before it enters the training loop.

4. Dataset Cleaning Checklist

Before you start your GPU, ensure your dataset passes these checks:

Uniqueness: Remove duplicate question-answer pairs.
Format Consistency: Are all JSON keys identical?
Length Diversity: Do you have a mix of short and long answers?
Correctness: Have you manually verified the "Ground Truth" of the answers?

Code Concept: Prepping Data with Python

import json

raw_conversations = [
    ("Tell me about the sun", "The sun is a star."),
    ("Tell me about Mars", "Mars is the red planet.")
]

def to_alpaca_jsonl(data, output_file):
    with open(output_file, 'w') as f:
        for q, a in data:
            entry = {
                "instruction": "Answer the question factually.",
                "input": q,
                "output": a
            }
            f.write(json.dumps(entry) + '\n')

to_alpaca_jsonl(raw_conversations, "train_data.jsonl")

Summary

Instruction Tuning: Format your data as JSONL (Instruction/Input/Output).
Synthetic Data: Use large models to generate training sets for small models.
Balance: Mix general-purpose data with specialized data to avoid "Forgetting."
Quality: Curate your data like a chef curates ingredients. One bad data point can sour the whole model.

In the next lesson, we will look at Evaluation, learning how to measure if your newly fine-tuned model is actually better than the original.

Exercise: Data Architect

You want to fine-tune a model to be a "C++ Code Explainer." You have 10,000 raw C++ files from your company. This is NOT a training dataset.

Describe the 3 steps you would take to turn these raw files into an Alpaca-style JSONL dataset.

Answer Logic:

Extraction: Identify clear functions or classes within the files.
Generation: Use a large LLM to write "Student Questions" and "Expert Summaries" for those snippets.
Structuring: Format them into JSON: instruction (Explain this code), input (The C++ snippet), output (The summary).