Dataset Preparation

The success of your fine-tune depends 90% on your data quality.

Format

Google AI Studio accepts CSV or JSONL. Each row needs:

input_text: The user prompt.
output_text: The ideal, perfect model response.

{"messages": [{"role": "user", "content": "Hi"}, {"role": "model", "content": "Greetings, traveler!"}]}
{"messages": [{"role": "user", "content": "Bye"}, {"role": "model", "content": "Safe travels!"}]}

Quality Control

Diversity: Don't just have 100 examples of "Hi". Have examples of hard questions, easy questions, and edge cases.
Consistency: Make sure all output_text examples follow the same style guidelines. If 50% are polite and 50% are rude, the model will just be confused.
Size:
- Minimum: ~20 examples (for simple style transfer).
- Recommended: 100 - 500 examples.
- Too Many: >10,000 examples usually has diminishing returns for simple tuning tasks and costs more.

Cleaning

Remove PII (Private Info).
Remove duplicate rows (it biases the model).
Ensure spell-check runs on the outputs (you don't want to teach the model to misspell).

Summary

Curate your dataset like a museum exhibit. Only the best examples get in.

In the next lesson, we look at Parameters and Hyperparameters.

Dataset Preparation for Tuning

Dataset Preparation

Format

Quality Control

Cleaning

Summary

Subscribe to our newsletter