Conversation Formats: The Grammar of Datasets

In the previous module, we focused on the content of the data. Now, we focus on the Structure.

When you feed data to a training engine, you can't just provide a block of text. You need to tell the engine: "This part is what the human said, and this part is what the model should respond." To do this, the community has settled on two primary standards: ChatML and ShareGPT.

Choosing the right format is critical because your model will learn the specific "Delimiters" (tags like <|im_start|>) used in those formats. If you train in ChatML but try to use your model in a different framework, it might become confused.

In this lesson, we will explore the nuances of these two formats and when to use each.

1. ChatML (Chat Markup Language)

Introduced by OpenAI, ChatML is a "Message-based" format designed to make the distinction between the system, user, and assistant very explicit. It is the gold standard for training "Chatbots."

The Structure:

It uses a sequence of dictionaries, where each dictionary has a role and content.

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help?"}
  ]
}

Why it works:

ChatML is extremely flexible. It allows for System Prompts (which set the rules) and multi-turn conversations. Most modern fine-tuning providers (OpenAI, AWS Bedrock, Google Vertex) default to this structure.

2. ShareGPT Format

ShareGPT is a flatter, often "Vicomte-style" format that became popular in the open-source community (Mistral, Llama, and the Vicuna project). It focus on "Conversations" rather than "Messages."

The Structure:

It uses a list of "Conversations" with from and value keys.

{
  "conversations": [
    {"from": "human", "value": "Hello!"},
    {"from": "gpt", "value": "Hi there! How can I help?"}
  ]
}

Why it works:

Many open-source datasets (like the LMSYS data) are already in this format. Frameworks like FastChat and Axolotl use ShareGPT as their primary ingestion format.

Technical Comparison: ChatML vs. ShareGPT

Feature	ChatML	ShareGPT
Standard-bearer	OpenAI / Enterprise AI	Hugging Face / Open Source
Key identifiers	`role`, `content`	`from`, `value`
System Prompt Support	Explicit	Often requires custom handling
Multi-turn	Native	Native
Best used for...	OpenAI fine-tuning, AWS Bedrock	Llama 3, Mistral, Local fine-tuning

Visualizing the Formatting Shift

graph TD
    A["Raw Chat Log"] -->|Conversion Script| B["Internal JSONL"]
    
    subgraph "Target 1: ChatML"
    B --> B1["{ 'messages': [...] }"]
    end
    
    subgraph "Target 2: ShareGPT"
    B --> B2["{ 'conversations': [...] }"]
    end
    
    B1 --> C["OpenAI / Bedrock Training"]
    B2 --> D["Axolotl / Unsloth Training"]

Implementation: Converting Between Formats

As an engineer, you will often need to convert data. Here is a Python utility to convert a ShareGPT-style list into a ChatML-style list.

def sharegpt_to_chatml(sharegpt_data):
    """
    Converts ShareGPT format to ChatML format.
    """
    chat_ml_messages = []
    
    role_map = {
        "human": "user",
        "gpt": "assistant",
        "system": "system"
    }
    
    for entry in sharegpt_data["conversations"]:
        chat_ml_messages.append({
            "role": role_map.get(entry["from"], "user"),
            "content": entry["value"]
        })
        
    return {"messages": chat_ml_messages}

# Test Data
old_data = {
    "conversations": [
        {"from": "human", "value": "What is 2+2?"},
        {"from": "gpt", "value": "It is 4."}
    ]
}

new_data = sharegpt_to_chatml(old_data)
print(new_data)
# Output: {'messages': [{'role': 'user', 'content': 'What is 2+2?'}, {'role': 'assistant', 'content': 'It is 4.'}]}

The "Special Token" Secret

When these JSON logs are fed to the model, the framework converts them into text with Special Tokens.

ChatML often uses tokens like <|im_start|> and <|im_end|>.
Llama 3 uses <|start_header_id|> and <|end_header_id|>.

CRITICAL WARNING: Do not hard-code these special tokens into your data unless you are an expert. Always provide the raw JSON to the training framework and let the Tokenizer handle the special formatting. If you get this wrong, your model will lose its ability to "Stop" talking or confuse roles.

Summary and Key Takeaways

ChatML is the standard for commercial APIs (role, content).
ShareGPT is the standard for open-source datasets (from, value).
Interoperability: You will frequently need to convert between these formats.
Don't Touch Tokens: provide clean JSON; let the framework's tokenizer add the <|im_start|> markers.

In the next lesson, we will look at Instruction Tuning Templates, exploring the older (but still relevant) "Alpaca" format vs. the modern "User/Assistant" split.

Reflection Exercise

Why is having a "System" role better than just putting the instructions at the top of the "User" message? (Hint: Think about which messages get 'masked' during loss calculation).
Look at a JSONL file on Hugging Face (e.g., Dolly-15k). Which format is it using?

SEO Metadata & Keywords

Focus Keywords: ChatML format vs ShareGPT, Dataset formatting for fine-tuning, OpenAI JSONL format, conversation dataset architecture, ShareGPT to ChatML converter. Meta Description: Master the two industry standards for AI conversation datasets. Learn the differences between ChatML and ShareGPT, and how to structure your data for successful model training.

Conversation Formats (ChatML, ShareGPT)

Conversation Formats: The Grammar of Datasets

1. ChatML (Chat Markup Language)

The Structure:

Why it works:

2. ShareGPT Format

The Structure:

Why it works:

Technical Comparison: ChatML vs. ShareGPT

Visualizing the Formatting Shift

Implementation: Converting Between Formats

The "Special Token" Secret

Summary and Key Takeaways

Reflection Exercise

SEO Metadata & Keywords

Subscribe to our newsletter