Dataset Preparation for Tuning

Dataset Preparation for Tuning

Garbage in, garbage out. Learn how to format, clean, and balance your dataset for successful Gemini fine-tuning.

Dataset Preparation

The success of your fine-tune depends 90% on your data quality.

Format

Google AI Studio accepts CSV or JSONL. Each row needs:

  • input_text: The user prompt.
  • output_text: The ideal, perfect model response.
{"messages": [{"role": "user", "content": "Hi"}, {"role": "model", "content": "Greetings, traveler!"}]}
{"messages": [{"role": "user", "content": "Bye"}, {"role": "model", "content": "Safe travels!"}]}

Quality Control

  1. Diversity: Don't just have 100 examples of "Hi". Have examples of hard questions, easy questions, and edge cases.
  2. Consistency: Make sure all output_text examples follow the same style guidelines. If 50% are polite and 50% are rude, the model will just be confused.
  3. Size:
    • Minimum: ~20 examples (for simple style transfer).
    • Recommended: 100 - 500 examples.
    • Too Many: >10,000 examples usually has diminishing returns for simple tuning tasks and costs more.

Cleaning

  • Remove PII (Private Info).
  • Remove duplicate rows (it biases the model).
  • Ensure spell-check runs on the outputs (you don't want to teach the model to misspell).

Summary

Curate your dataset like a museum exhibit. Only the best examples get in.

In the next lesson, we look at Parameters and Hyperparameters.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn