
Automated Format Validation Scripts
The 'Pre-Flight' Check. Learn how to build a robust validation script to catch missing keys, invalid JSON, and role errors before you spend a dime on training.
Automated Format Validation Scripts: The Pre-Flight Check
In the previous lessons, we learned how to format data for OpenAI, Bedrock, and Vertex AI. But formatting is a manual process, and humans make mistakes. One missing comma or one extra space can cause a high-memory training job to crash halfway through, wasting hours of time and hundreds of dollars.
In the industry, we use Validation Scripts. These are small, non-AI Python scripts that "stress-test" your JSONL file. They check for syntax errors, schema violations, and logic errors (like a "conversation" that has no "User" message).
In this final lesson of Module 6, we will build a production-grade validator for OpenAI's ChatML format.
What a Good Validator Should Check
A professional validator doesn't just check if the JSON is valid. it checks for:
- Strict Schema: Does every line have a
messageskey? Does every message have aroleandcontent? - Role Logic: Does it have at least one
userand oneassistantmessage? - Token Budget: Is the total length of the conversation within the model's limit (e.g., 4,096 tokens)?
- Character Encoding: Are there any "Broken" UTF-8 characters?
Implementation: The "Aero-Validator" Script
This script is your "Pre-Flight" checklist. Before you upload a file to a fine-tuning service, you should run this script and fix every warning.
import json
import os
def validate_fine_tuning_jsonl(file_path):
print(f"--- Starting Validation for {file_path} ---")
errors = 0
warnings = 0
if not os.path.exists(file_path):
print(f"[ERROR] File not found: {file_path}")
return
with open(file_path, 'r', encoding='utf-8') as f:
for idx, line in enumerate(f):
line_num = idx + 1
# --- 1. JSON Syntax Check ---
try:
data = json.loads(line)
except json.JSONDecodeError:
print(f"[ERROR] Line {line_num}: Invalid JSON syntax.")
errors += 1
continue
# --- 2. Root Key Check ---
if "messages" not in data:
print(f"[ERROR] Line {line_num}: Missing 'messages' root key.")
errors += 1
continue
messages = data["messages"]
# --- 3. Message Sequence Logic ---
roles = [m.get("role") for m in messages]
if "user" not in roles:
print(f"[WARNING] Line {line_num}: No 'user' message found.")
warnings += 1
if "assistant" not in roles:
print(f"[ERROR] Line {line_num}: No 'assistant' message found (Nothing for the model to learn!).")
errors += 1
# --- 4. Content Check ---
for m_idx, msg in enumerate(messages):
if "content" not in msg or not msg["content"].strip():
print(f"[ERROR] Line {line_num}, Message {m_idx}: Empty content.")
errors += 1
# --- 5. Token Limit Check (Heuristic) ---
# Roughly 1 token per 4 chars as a safe baseline
total_chars = sum(len(m["content"]) for m in messages if "content" in m)
if total_chars > 16000: # Approx 4000 tokens
print(f"[WARNING] Line {line_num}: Conversation is very long (~4000+ tokens).")
warnings += 1
print(f"--- Validation Complete: {errors} Errors, {warnings} Warnings ---")
return errors == 0
# Usage
# validate_fine_tuning_jsonl('my_training_data.jsonl')
Why "Warnings" Matter
Notice that we separate Errors (which break the training) from Warnings (which just make the model bad).
- A line without an Assistant message is an ERROR. The trainer literally has nothing to calculate the Gradient for.
- A line that is too long is a WARNING. The trainer will work, but it might cut off the text and lose the final answer.
Continuous Integration (CI) for AI Data
If you are working in a professional team, you should add this validator to your GitHub Actions or CI Pipeline.
Every time a curator adds a "Golden Example" to the data/ folder, the script runs automatically. This prevents "Data Contamination" from entering your master training set.
Summary and Key Takeaways
- Validation is Mandatory: Never start a training job without a validaton script.
- Schema Consistency: Ensure every line follows the exact roles and keys required by your provider.
- Logic over Syntax: Check that the model actually has something to learn (an assistant response).
- Automation: Integrate validation into your development workflow to catch errors early.
You have completed Module 6! You are now a master of Data Formatting and Integrity.
In Module 7, we will move to the "Edge" of the model: Tokenization and Input Preparation, where we look at how text is converted into the numbers the GPU understands.
Reflection Exercise
- Why does the script check for "Empty Content"? (Hint: What happens to a mathematical 'Loss function' if the target response is nothing?)
- If your model supports a 128k context window, would you change the "Heuristic" check in Step 5?
SEO Metadata & Keywords
Focus Keywords: Fine-Tuning Validation Script, JSONL Validator Python, AI Data Integrity Check, model training pre-flight, ChatML schema validation. Meta Description: Protect your training budget. Learn how to build a robust automated validation script for your fine-tuning datasets to catch schema errors and logic flaws before training.