Automated Format Validation Scripts

Automated Format Validation Scripts

The 'Pre-Flight' Check. Learn how to build a robust validation script to catch missing keys, invalid JSON, and role errors before you spend a dime on training.

Automated Format Validation Scripts: The Pre-Flight Check

In the previous lessons, we learned how to format data for OpenAI, Bedrock, and Vertex AI. But formatting is a manual process, and humans make mistakes. One missing comma or one extra space can cause a high-memory training job to crash halfway through, wasting hours of time and hundreds of dollars.

In the industry, we use Validation Scripts. These are small, non-AI Python scripts that "stress-test" your JSONL file. They check for syntax errors, schema violations, and logic errors (like a "conversation" that has no "User" message).

In this final lesson of Module 6, we will build a production-grade validator for OpenAI's ChatML format.


What a Good Validator Should Check

A professional validator doesn't just check if the JSON is valid. it checks for:

  1. Strict Schema: Does every line have a messages key? Does every message have a role and content?
  2. Role Logic: Does it have at least one user and one assistant message?
  3. Token Budget: Is the total length of the conversation within the model's limit (e.g., 4,096 tokens)?
  4. Character Encoding: Are there any "Broken" UTF-8 characters?

Implementation: The "Aero-Validator" Script

This script is your "Pre-Flight" checklist. Before you upload a file to a fine-tuning service, you should run this script and fix every warning.

import json
import os

def validate_fine_tuning_jsonl(file_path):
    print(f"--- Starting Validation for {file_path} ---")
    errors = 0
    warnings = 0
    
    if not os.path.exists(file_path):
        print(f"[ERROR] File not found: {file_path}")
        return

    with open(file_path, 'r', encoding='utf-8') as f:
        for idx, line in enumerate(f):
            line_num = idx + 1
            
            # --- 1. JSON Syntax Check ---
            try:
                data = json.loads(line)
            except json.JSONDecodeError:
                print(f"[ERROR] Line {line_num}: Invalid JSON syntax.")
                errors += 1
                continue
            
            # --- 2. Root Key Check ---
            if "messages" not in data:
                print(f"[ERROR] Line {line_num}: Missing 'messages' root key.")
                errors += 1
                continue
            
            messages = data["messages"]
            
            # --- 3. Message Sequence Logic ---
            roles = [m.get("role") for m in messages]
            
            if "user" not in roles:
                print(f"[WARNING] Line {line_num}: No 'user' message found.")
                warnings += 1
            if "assistant" not in roles:
                print(f"[ERROR] Line {line_num}: No 'assistant' message found (Nothing for the model to learn!).")
                errors += 1
                
            # --- 4. Content Check ---
            for m_idx, msg in enumerate(messages):
                if "content" not in msg or not msg["content"].strip():
                    print(f"[ERROR] Line {line_num}, Message {m_idx}: Empty content.")
                    errors += 1
                    
            # --- 5. Token Limit Check (Heuristic) ---
            # Roughly 1 token per 4 chars as a safe baseline
            total_chars = sum(len(m["content"]) for m in messages if "content" in m)
            if total_chars > 16000: # Approx 4000 tokens
                print(f"[WARNING] Line {line_num}: Conversation is very long (~4000+ tokens).")
                warnings += 1

    print(f"--- Validation Complete: {errors} Errors, {warnings} Warnings ---")
    return errors == 0

# Usage
# validate_fine_tuning_jsonl('my_training_data.jsonl')

Why "Warnings" Matter

Notice that we separate Errors (which break the training) from Warnings (which just make the model bad).

  • A line without an Assistant message is an ERROR. The trainer literally has nothing to calculate the Gradient for.
  • A line that is too long is a WARNING. The trainer will work, but it might cut off the text and lose the final answer.

Continuous Integration (CI) for AI Data

If you are working in a professional team, you should add this validator to your GitHub Actions or CI Pipeline. Every time a curator adds a "Golden Example" to the data/ folder, the script runs automatically. This prevents "Data Contamination" from entering your master training set.


Summary and Key Takeaways

  • Validation is Mandatory: Never start a training job without a validaton script.
  • Schema Consistency: Ensure every line follows the exact roles and keys required by your provider.
  • Logic over Syntax: Check that the model actually has something to learn (an assistant response).
  • Automation: Integrate validation into your development workflow to catch errors early.

You have completed Module 6! You are now a master of Data Formatting and Integrity.

In Module 7, we will move to the "Edge" of the model: Tokenization and Input Preparation, where we look at how text is converted into the numbers the GPU understands.


Reflection Exercise

  1. Why does the script check for "Empty Content"? (Hint: What happens to a mathematical 'Loss function' if the target response is nothing?)
  2. If your model supports a 128k context window, would you change the "Heuristic" check in Step 5?

SEO Metadata & Keywords

Focus Keywords: Fine-Tuning Validation Script, JSONL Validator Python, AI Data Integrity Check, model training pre-flight, ChatML schema validation. Meta Description: Protect your training budget. Learn how to build a robust automated validation script for your fine-tuning datasets to catch schema errors and logic flaws before training.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn