JSON vs. YAML vs. Markdown: The Token Benchmarks

JSON vs. YAML vs. Markdown: The Token Benchmarks

Master the data-format economics of AI. Learn which format uses the fewest tokens for your specific data structure.

JSON vs. YAML vs. Markdown: The Token Benchmarks

Not all "Structures" are created equal. To a human, a JSON object and a YAML list contain the same info. To an LLM, they represent a significantly different Token Bill.

In this lesson, we perform a deep-dive into the "Syntax Density" of data formats. We will learn why YAML is often the "Gold Standard" for token-conscious developers, why JSON is a "Safe Default," and how Markdown Tables can actually be the most expensive format of all.


1. The Token Payload Comparison

Let's represent a list of 2 users: John Doe (Age 30) and Jane Smith (Age 25).

JSON (50 Tokens)

{"users": [{"name": "John Doe", "age": 30}, {"name": "Jane Smith", "age": 25}]}
  • Waste: Quotes, braces, and colons are heavy.

YAML (35 Tokens)

users:
  - name: John Doe
    age: 30
  - name: Jane Smith
    age: 25
  • Benefit: No quotes needed for many keys. No closing braces. High density.

Markdown Table (70 Tokens)

| Name | Age |
| --- | --- |
| John Doe | 30 |
| Jane Smith | 25 |
  • Waste: Pipes and dashes for visual alignment are Pure Token Bloat.

2. Choosing Based on Data Complexity

FormatBest ForToken Rank
YAMLComplex nested objects.1 (Most Efficient)
JSONNative machine integration.2
CSVLarge flat lists of numbers.0 (Extreme Density)
MarkdownHuman-only readability.3 (Most Expensive)

3. Implementation: Using YAML (Python)

You can instruct your LLM to output YAML and then parse it using PyYAML.

Python Code: YAML Extraction

import yaml

prompt = (
    "Extract the data. Output ONLY raw YAML. No markdown blocks. "
    "Format:\n---\nitems:\n- name: string"
)

response = call_llm(prompt, raw_text)

# Parse the compact YAML
data = yaml.safe_load(response)
print(data['items'][0]['name'])

Savings: For a list of 100 items, YAML can save you 1,000 to 2,000 tokens compared to JSON.


4. The "Key Shortening" Strategy (The 'DSL' approach)

Regardless of the format, you should minify your Keys.

  • Original: {"transaction_id": 987, "total_amount_usd": 50.00}
  • Minified: {"tid": 987, "amt": 50.00}

By using a "Mapping" in your Python code, you gain the benefits of descriptive keys in your software while using the lowest possible tokens in the AI's window.


5. Token Efficiency and "Streaming"

JSON is hard to stream. You can't parse it until the final } is generated. YAML and Markdown are more "Stream-Friendly." You can begin processing the first item while the model is still generating the second. While this doesn't save tokens, it improves Perceived Latency (UI Efficiency).


6. Summary and Key Takeaways

  1. Prefer YAML for Depth: It is 20-30% more efficient than JSON for complex data.
  2. Avoid Markdown for Data: Pipes and dashes are expensive visual noise.
  3. Shorten Keys: Use id instead of document_identification_number.
  4. CSV for Flat Lists: If you only need numbers/names, CSV is the ultimate winner.

In the next lesson, Enforcing Schema Constraints (Pydantic), we look at چگونه to ensure the model actually follows these efficient rules.


Exercise: The Format Race

  1. Represent a 3x3 matrix (1-9) in three ways:
    • A: Nested JSON list.
    • B: YAML.
    • C: Comma-separated (CSV).
  2. Count the tokens.
  3. Analyze: How much "Tax" did JSON add compared to CSV?
  • (Result: CSV is usually 4x smaller).
  • Ask: If you are sending 1,000 such matrices, how many dollars did CSV save you?

Congratulations on completing Module 13 Lesson 2! You are now a data format strategist.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn