
Dataset Preparation for Tuning
Garbage in, garbage out. Learn how to format, clean, and balance your dataset for successful Gemini fine-tuning.
Dataset Preparation
The success of your fine-tune depends 90% on your data quality.
Format
Google AI Studio accepts CSV or JSONL. Each row needs:
input_text: The user prompt.output_text: The ideal, perfect model response.
{"messages": [{"role": "user", "content": "Hi"}, {"role": "model", "content": "Greetings, traveler!"}]}
{"messages": [{"role": "user", "content": "Bye"}, {"role": "model", "content": "Safe travels!"}]}
Quality Control
- Diversity: Don't just have 100 examples of "Hi". Have examples of hard questions, easy questions, and edge cases.
- Consistency: Make sure all
output_textexamples follow the same style guidelines. If 50% are polite and 50% are rude, the model will just be confused. - Size:
- Minimum: ~20 examples (for simple style transfer).
- Recommended: 100 - 500 examples.
- Too Many: >10,000 examples usually has diminishing returns for simple tuning tasks and costs more.
Cleaning
- Remove PII (Private Info).
- Remove duplicate rows (it biases the model).
- Ensure spell-check runs on the outputs (you don't want to teach the model to misspell).
Summary
Curate your dataset like a museum exhibit. Only the best examples get in.
In the next lesson, we look at Parameters and Hyperparameters.