Instruction Tuning: Formatting Your Knowledge

The quality of your fine-tuned model depends 1% on the algorithm and 99% on your Data. If you give the model sloppy examples, it will speak sloppily.

The industry standard for training data is JSONL (JSON Lines).

1. What is JSONL?

It is a text file where every line is a complete, valid JSON object. This allows training tools to read the file one line at a time without loading a 2GB JSON file into RAM.

2. The Chat Format (Alpaca / ChatML)

To train a chat-based model, you must provide "Turns":

{"instruction": "Translate this to Pirate", "input": "Hello friend", "output": "Ahoy there, matey!"}
{"instruction": "Translate this to Pirate", "input": "Where is the boat?", "output": "Where be the vessel, ye landlubber?"}

Instruction: What the user wants.
Input: The specific data (Optional).
Output: The "Perfect" answer you want the AI to learn.

3. How Much Data Do You Need?

10 - 100 Examples: (Few-shot) The model will learn a very specific style or a few new names of products.
500 - 1,000 Examples: The model will learn a complex new format (like a custom programming language).
5,000+ Examples: Significant knowledge injection or new language support.

4. Tips for High-Quality Data

Diversity: Don't repeat the same question 100 times. Change the phrasing!
Accuracy: Double-check every "Output" line. If you have a typo in your training data, your model will have that typo forever.
Synthetic Data: You can use a larger, smarter model (like GPT-4o) to generate the training data for your smaller local model. This is called "Knowledge Distillation."

5. Cleaning Your Data

Before training:

Remove duplicates.
Ensure all text is UTF-8 encoded.
Use a tool like jsonl-validator to ensure there are no broken brackets that will crash your training run halfway through.

Key Takeaways

JSONL is the mandatory file format for local fine-tuning.
Instruction/Input/Output is the standard structure for teaching a model.
Quality Beats Quantity: 100 perfect examples are better than 1,000 sloppy ones.
Knowledge Distillation uses a larger model to create the training data for your local model.

Module 11 Lesson 3: Training Data Preparation