Module 11 Lesson 3: Training Data Preparation
Garbage In, garbage out. How to format your data in JSONL for successful fine-tuning.
Instruction Tuning: Formatting Your Knowledge
The quality of your fine-tuned model depends 1% on the algorithm and 99% on your Data. If you give the model sloppy examples, it will speak sloppily.
The industry standard for training data is JSONL (JSON Lines).
1. What is JSONL?
It is a text file where every line is a complete, valid JSON object. This allows training tools to read the file one line at a time without loading a 2GB JSON file into RAM.
2. The Chat Format (Alpaca / ChatML)
To train a chat-based model, you must provide "Turns":
{"instruction": "Translate this to Pirate", "input": "Hello friend", "output": "Ahoy there, matey!"}
{"instruction": "Translate this to Pirate", "input": "Where is the boat?", "output": "Where be the vessel, ye landlubber?"}
- Instruction: What the user wants.
- Input: The specific data (Optional).
- Output: The "Perfect" answer you want the AI to learn.
3. How Much Data Do You Need?
- 10 - 100 Examples: (Few-shot) The model will learn a very specific style or a few new names of products.
- 500 - 1,000 Examples: The model will learn a complex new format (like a custom programming language).
- 5,000+ Examples: Significant knowledge injection or new language support.
4. Tips for High-Quality Data
- Diversity: Don't repeat the same question 100 times. Change the phrasing!
- Accuracy: Double-check every "Output" line. If you have a typo in your training data, your model will have that typo forever.
- Synthetic Data: You can use a larger, smarter model (like GPT-4o) to generate the training data for your smaller local model. This is called "Knowledge Distillation."
5. Cleaning Your Data
Before training:
- Remove duplicates.
- Ensure all text is UTF-8 encoded.
- Use a tool like
jsonl-validatorto ensure there are no broken brackets that will crash your training run halfway through.
Key Takeaways
- JSONL is the mandatory file format for local fine-tuning.
- Instruction/Input/Output is the standard structure for teaching a model.
- Quality Beats Quantity: 100 perfect examples are better than 1,000 sloppy ones.
- Knowledge Distillation uses a larger model to create the training data for your local model.