
JSON vs. YAML vs. Markdown: The Token Benchmarks
Master the data-format economics of AI. Learn which format uses the fewest tokens for your specific data structure.
JSON vs. YAML vs. Markdown: The Token Benchmarks
Not all "Structures" are created equal. To a human, a JSON object and a YAML list contain the same info. To an LLM, they represent a significantly different Token Bill.
In this lesson, we perform a deep-dive into the "Syntax Density" of data formats. We will learn why YAML is often the "Gold Standard" for token-conscious developers, why JSON is a "Safe Default," and how Markdown Tables can actually be the most expensive format of all.
1. The Token Payload Comparison
Let's represent a list of 2 users: John Doe (Age 30) and Jane Smith (Age 25).
JSON (50 Tokens)
{"users": [{"name": "John Doe", "age": 30}, {"name": "Jane Smith", "age": 25}]}
- Waste: Quotes, braces, and colons are heavy.
YAML (35 Tokens)
users:
- name: John Doe
age: 30
- name: Jane Smith
age: 25
- Benefit: No quotes needed for many keys. No closing braces. High density.
Markdown Table (70 Tokens)
| Name | Age |
| --- | --- |
| John Doe | 30 |
| Jane Smith | 25 |
- Waste: Pipes and dashes for visual alignment are Pure Token Bloat.
2. Choosing Based on Data Complexity
| Format | Best For | Token Rank |
|---|---|---|
| YAML | Complex nested objects. | 1 (Most Efficient) |
| JSON | Native machine integration. | 2 |
| CSV | Large flat lists of numbers. | 0 (Extreme Density) |
| Markdown | Human-only readability. | 3 (Most Expensive) |
3. Implementation: Using YAML (Python)
You can instruct your LLM to output YAML and then parse it using PyYAML.
Python Code: YAML Extraction
import yaml
prompt = (
"Extract the data. Output ONLY raw YAML. No markdown blocks. "
"Format:\n---\nitems:\n- name: string"
)
response = call_llm(prompt, raw_text)
# Parse the compact YAML
data = yaml.safe_load(response)
print(data['items'][0]['name'])
Savings: For a list of 100 items, YAML can save you 1,000 to 2,000 tokens compared to JSON.
4. The "Key Shortening" Strategy (The 'DSL' approach)
Regardless of the format, you should minify your Keys.
- Original:
{"transaction_id": 987, "total_amount_usd": 50.00} - Minified:
{"tid": 987, "amt": 50.00}
By using a "Mapping" in your Python code, you gain the benefits of descriptive keys in your software while using the lowest possible tokens in the AI's window.
5. Token Efficiency and "Streaming"
JSON is hard to stream. You can't parse it until the final } is generated.
YAML and Markdown are more "Stream-Friendly." You can begin processing the first item while the model is still generating the second.
While this doesn't save tokens, it improves Perceived Latency (UI Efficiency).
6. Summary and Key Takeaways
- Prefer YAML for Depth: It is 20-30% more efficient than JSON for complex data.
- Avoid Markdown for Data: Pipes and dashes are expensive visual noise.
- Shorten Keys: Use
idinstead ofdocument_identification_number. - CSV for Flat Lists: If you only need numbers/names, CSV is the ultimate winner.
In the next lesson, Enforcing Schema Constraints (Pydantic), we look at چگونه to ensure the model actually follows these efficient rules.
Exercise: The Format Race
- Represent a 3x3 matrix (1-9) in three ways:
- A: Nested JSON list.
- B: YAML.
- C: Comma-separated (CSV).
- Count the tokens.
- Analyze: How much "Tax" did JSON add compared to CSV?
- (Result: CSV is usually 4x smaller).
- Ask: If you are sending 1,000 such matrices, how many dollars did CSV save you?