Module 11 Lesson 3: Training Data Preparation
·AI & LLMs

Module 11 Lesson 3: Training Data Preparation

Garbage In, garbage out. How to format your data in JSONL for successful fine-tuning.

Instruction Tuning: Formatting Your Knowledge

The quality of your fine-tuned model depends 1% on the algorithm and 99% on your Data. If you give the model sloppy examples, it will speak sloppily.

The industry standard for training data is JSONL (JSON Lines).

1. What is JSONL?

It is a text file where every line is a complete, valid JSON object. This allows training tools to read the file one line at a time without loading a 2GB JSON file into RAM.


2. The Chat Format (Alpaca / ChatML)

To train a chat-based model, you must provide "Turns":

{"instruction": "Translate this to Pirate", "input": "Hello friend", "output": "Ahoy there, matey!"}
{"instruction": "Translate this to Pirate", "input": "Where is the boat?", "output": "Where be the vessel, ye landlubber?"}
  • Instruction: What the user wants.
  • Input: The specific data (Optional).
  • Output: The "Perfect" answer you want the AI to learn.

3. How Much Data Do You Need?

  • 10 - 100 Examples: (Few-shot) The model will learn a very specific style or a few new names of products.
  • 500 - 1,000 Examples: The model will learn a complex new format (like a custom programming language).
  • 5,000+ Examples: Significant knowledge injection or new language support.

4. Tips for High-Quality Data

  1. Diversity: Don't repeat the same question 100 times. Change the phrasing!
  2. Accuracy: Double-check every "Output" line. If you have a typo in your training data, your model will have that typo forever.
  3. Synthetic Data: You can use a larger, smarter model (like GPT-4o) to generate the training data for your smaller local model. This is called "Knowledge Distillation."

5. Cleaning Your Data

Before training:

  • Remove duplicates.
  • Ensure all text is UTF-8 encoded.
  • Use a tool like jsonl-validator to ensure there are no broken brackets that will crash your training run halfway through.

Key Takeaways

  • JSONL is the mandatory file format for local fine-tuning.
  • Instruction/Input/Output is the standard structure for teaching a model.
  • Quality Beats Quantity: 100 perfect examples are better than 1,000 sloppy ones.
  • Knowledge Distillation uses a larger model to create the training data for your local model.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn