
Formatting for OpenAI vs Bedrock vs Vertex AI
The Cloud Blueprint. Learn the precise JSONL specifications for OpenAI, AWS Bedrock, and Google Vertex AI, and how to avoid 'Format Failure' rejections.
Formatting for OpenAI vs. Bedrock vs. Vertex AI: The Provider Blueprint
You have your "Golden Dataset." You’ve chosen your "Conversation Format." Now comes the final hurdle: Deployment.
Each major cloud provider has its own subtle variations of the JSONL (JSON Lines) format. If you send a file to OpenAI that was formatted for AWS Bedrock, the training job will fail with a "Schema Validation Error." These errors are frustrating because they often don't tell you where the error is—just that your file is "Invalid."
In this lesson, we will provide the exact specifications for the "Big Three" providers so you can submit your training jobs with confidence.
1. OpenAI (The ChatML Standard)
OpenAI uses a very strict ChatML structure inside a .jsonl file. Each line must be a single JSON object containing a messages key.
The Specification:
{"messages": [{"role": "system", "content": "You are a biology tutor."}, {"role": "user", "content": "What is a cell?"}, {"role": "assistant", "content": "A cell is the basic building block of all living things."}]}
{"messages": [{"role": "system", "content": "You are a biology tutor."}, {"role": "user", "content": "What is DNA?"}, {"role": "assistant", "content": "DNA is the molecule that carries genetic instructions."}]}
Key Rules:
- System Token: The
systemmessage is highly recommended to anchor behavior. - No Trailing Newlines: Ensure there are no empty lines at the end of the file.
- UTF-8 Encoding: The file must be saved in UTF-8.
2. AWS Bedrock (The Converse API Standard)
AWS Bedrock (specifically for models like Llama 3 or Mistral) uses a format that mirrors their Converse API. They often expect a nested structure that distinguishes between prompt and completion or a system block.
The Specification (Llama 3 on Bedrock):
{"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", "completion": "Hi there! How can I help?<|eot_id|>"}
Key Rules:
- Explicit Tokens: Unlike OpenAI, Bedrock occasionally requires you to include the model's special tokens (like
<|eot_id|>) directly in the JSONL strings for certain "Custom Model" jobs. - S3 Upload: You must upload this file to an S3 bucket before starting the job.
3. Google Vertex AI (The PaLM/Gemini Standard)
Google’s format is similar to OpenAI’s messages but often uses slightly different key names or nesting if you are using their "Tuning Pipeline."
The Specification:
{"contents": [{"role": "user", "parts": [{"text": "What is the capital of France?"}]}, {"role": "model", "parts": [{"text": "The capital of France is Paris."}]}]}
Key Rules:
- Role Names: Google uses
modelinstead ofassistant. - Parts Structure: The content must be wrapped in a
partslist, reflecting Google's "Multimodal" first architecture (where a part could be text, an image, or a video).
Visualizing the Provider Map
graph TD
A["Your Clean Data (List of Dicts)"] --> B["Provider Formatter"]
B --> C["OpenAI: 'messages' -> 'role'"]
B --> D["AWS Bedrock: 'prompt' / 'completion'"]
B --> E["Google Vertex: 'contents' -> 'parts'"]
C --> F["Upload to OpenAI Storage"]
D --> G["Upload to S3"]
E --> H["Upload to GCS Bucket"]
Implementation: The "Universal Converter" Script
Here is a Python utility to take a generic message list and format it for all three providers.
import json
def format_for_provider(messages, provider="openai"):
if provider == "openai":
return json.dumps({"messages": messages})
elif provider == "google":
google_contents = []
for m in messages:
role = "model" if m["role"] == "assistant" else m["role"]
if role != "system": # Google often ignores system roles in Tuning
google_contents.append({"role": role, "parts": [{"text": m["content"]}]})
return json.dumps({"contents": google_contents})
elif provider == "bedrock":
# Simplified Bedrock format
prompt = ""
completion = ""
for m in messages:
if m["role"] == "assistant":
completion = m["content"]
else:
prompt += f"{m['role'].upper()}: {m['content']}\n"
return json.dumps({"prompt": prompt, "completion": completion})
# Usage
msg_list = [{"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello"}]
print(f"OpenAI: {format_for_provider(msg_list, 'openai')}")
print(f"Google: {format_for_provider(msg_list, 'google')}")
Summary and Key Takeaways
- OpenAI uses
messageswithroleandcontent. - AWS Bedrock often prefers a
prompt/completionsplit in JSONL. - Google Vertex AI uses
contentsandpartsto support multimodal inputs. - Validation: Always run a small script to validate every line of your JSONL before uploading. A single misplaced bracket will crash a $500 training job.
In the next lesson, we will look at exactly how to perform that validation: Converting Raw Data to JSONL and the automated scripts to keep your data clean.
Reflection Exercise
- Why does Google use a
partsarray instead of just a string? (Hint: What if the user sends an image AND a text question at the same time?) - If you are training on Bedrock, why is it important to know the specific "Special Tokens" of your model (like Llama 3's
<|eot_id|>)?
SEO Metadata & Keywords
Focus Keywords: OpenAI fine-tuning JSONL, AWS Bedrock fine-tuning format, Google Vertex AI tuning JSONL, JSONL vs JSON, cloud AI training specifications. Meta Description: Master the precise file formats for OpenAI, AWS, and Google. Learn how to structure your JSONL files for each major cloud provider and avoid costly training rejections.