Formatting for OpenAI vs. Bedrock vs. Vertex AI: The Provider Blueprint

You have your "Golden Dataset." You’ve chosen your "Conversation Format." Now comes the final hurdle: Deployment.

Each major cloud provider has its own subtle variations of the JSONL (JSON Lines) format. If you send a file to OpenAI that was formatted for AWS Bedrock, the training job will fail with a "Schema Validation Error." These errors are frustrating because they often don't tell you where the error is—just that your file is "Invalid."

In this lesson, we will provide the exact specifications for the "Big Three" providers so you can submit your training jobs with confidence.

1. OpenAI (The ChatML Standard)

OpenAI uses a very strict ChatML structure inside a .jsonl file. Each line must be a single JSON object containing a messages key.

The Specification:

{"messages": [{"role": "system", "content": "You are a biology tutor."}, {"role": "user", "content": "What is a cell?"}, {"role": "assistant", "content": "A cell is the basic building block of all living things."}]}
{"messages": [{"role": "system", "content": "You are a biology tutor."}, {"role": "user", "content": "What is DNA?"}, {"role": "assistant", "content": "DNA is the molecule that carries genetic instructions."}]}

Key Rules:

System Token: The system message is highly recommended to anchor behavior.
No Trailing Newlines: Ensure there are no empty lines at the end of the file.
UTF-8 Encoding: The file must be saved in UTF-8.

2. AWS Bedrock (The Converse API Standard)

AWS Bedrock (specifically for models like Llama 3 or Mistral) uses a format that mirrors their Converse API. They often expect a nested structure that distinguishes between prompt and completion or a system block.

The Specification (Llama 3 on Bedrock):

{"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", "completion": "Hi there! How can I help?<|eot_id|>"}

Key Rules:

Explicit Tokens: Unlike OpenAI, Bedrock occasionally requires you to include the model's special tokens (like <|eot_id|>) directly in the JSONL strings for certain "Custom Model" jobs.
S3 Upload: You must upload this file to an S3 bucket before starting the job.

3. Google Vertex AI (The PaLM/Gemini Standard)

Google’s format is similar to OpenAI’s messages but often uses slightly different key names or nesting if you are using their "Tuning Pipeline."

The Specification:

{"contents": [{"role": "user", "parts": [{"text": "What is the capital of France?"}]}, {"role": "model", "parts": [{"text": "The capital of France is Paris."}]}]}

Key Rules:

Role Names: Google uses model instead of assistant.
Parts Structure: The content must be wrapped in a parts list, reflecting Google's "Multimodal" first architecture (where a part could be text, an image, or a video).

Visualizing the Provider Map

graph TD
    A["Your Clean Data (List of Dicts)"] --> B["Provider Formatter"]
    
    B --> C["OpenAI: 'messages' -> 'role'"]
    B --> D["AWS Bedrock: 'prompt' / 'completion'"]
    B --> E["Google Vertex: 'contents' -> 'parts'"]
    
    C --> F["Upload to OpenAI Storage"]
    D --> G["Upload to S3"]
    E --> H["Upload to GCS Bucket"]

Implementation: The "Universal Converter" Script

Here is a Python utility to take a generic message list and format it for all three providers.

import json

def format_for_provider(messages, provider="openai"):
    if provider == "openai":
        return json.dumps({"messages": messages})
    
    elif provider == "google":
        google_contents = []
        for m in messages:
            role = "model" if m["role"] == "assistant" else m["role"]
            if role != "system": # Google often ignores system roles in Tuning
                google_contents.append({"role": role, "parts": [{"text": m["content"]}]})
        return json.dumps({"contents": google_contents})
    
    elif provider == "bedrock":
        # Simplified Bedrock format
        prompt = ""
        completion = ""
        for m in messages:
            if m["role"] == "assistant":
                completion = m["content"]
            else:
                prompt += f"{m['role'].upper()}: {m['content']}\n"
        return json.dumps({"prompt": prompt, "completion": completion})

# Usage
msg_list = [{"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello"}]
print(f"OpenAI: {format_for_provider(msg_list, 'openai')}")
print(f"Google: {format_for_provider(msg_list, 'google')}")

Summary and Key Takeaways

OpenAI uses messages with role and content.
AWS Bedrock often prefers a prompt/completion split in JSONL.
Google Vertex AI uses contents and parts to support multimodal inputs.
Validation: Always run a small script to validate every line of your JSONL before uploading. A single misplaced bracket will crash a $500 training job.

In the next lesson, we will look at exactly how to perform that validation: Converting Raw Data to JSONL and the automated scripts to keep your data clean.

Reflection Exercise

Why does Google use a parts array instead of just a string? (Hint: What if the user sends an image AND a text question at the same time?)
If you are training on Bedrock, why is it important to know the specific "Special Tokens" of your model (like Llama 3's <|eot_id|>)?

SEO Metadata & Keywords

Focus Keywords: OpenAI fine-tuning JSONL, AWS Bedrock fine-tuning format, Google Vertex AI tuning JSONL, JSONL vs JSON, cloud AI training specifications. Meta Description: Master the precise file formats for OpenAI, AWS, and Google. Learn how to structure your JSONL files for each major cloud provider and avoid costly training rejections.

Formatting for OpenAI vs Bedrock vs Vertex AI

Formatting for OpenAI vs. Bedrock vs. Vertex AI: The Provider Blueprint

1. OpenAI (The ChatML Standard)

The Specification:

Key Rules:

2. AWS Bedrock (The Converse API Standard)

The Specification (Llama 3 on Bedrock):

Key Rules:

3. Google Vertex AI (The PaLM/Gemini Standard)

The Specification:

Key Rules:

Visualizing the Provider Map

Implementation: The "Universal Converter" Script

Summary and Key Takeaways

Reflection Exercise

SEO Metadata & Keywords

Subscribe to our newsletter