
Converting Raw Data to JSONL
From CSV to Model-Ready. Learn the Python patterns for reading messy files and writing them into the performance-optimized JSONL format.
Converting Raw Data to JSONL: The Final Pipe
In the world of AI data, JSONL (JSON Lines) is the universal language. Unlike a standard .json file (which is one giant list), a .jsonl file is a series of independent JSON objects, each on its own line.
- JSON:
[{"a":1}, {"b":2}]-> Hard to read line-by-line. - JSONL:
{"a":1}\n{"b":2}-> Extremely efficient for training engines to stream into GPU memory.
In this lesson, we will write the Python "Piping" code to take raw data from CSVs, JSONs, and text files and convert them into the clean JSONL files required for fine-tuning.
Why JSONL? (The Engineer's Perspective)
When you train a model on a dataset of 100GB, you cannot load the entire file into memory at once.
- Streaming: JSONL allows the trainer to read one line at a time. If the file is corrupted at line 5,000, the trainer can still process lines 1-4,999.
- Memory Efficiency: You don't need to parse a massive "Outer" list bracket.
- Parallelism: Since each line is independent, multi-GPU trainers can "Pick and Choose" lines from different parts of the file simultaneously.
1. Converting from CSV to JSONL
CSV (Comma Separated Values) is the most common format for labeled data from non-technical teams.
import pandas as pd
import json
def csv_to_jsonl(csv_path, output_path):
# 1. Read the CSV
df = pd.read_csv(csv_path)
# 2. Open the JSONL file
with open(output_path, 'w', encoding='utf-8') as f:
for idx, row in df.iterrows():
# Build the ChatML structure
obj = {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": row['instruction']},
{"role": "assistant", "content": row['response']}
]
}
# Write as a single line
f.write(json.dumps(obj) + '\n')
# csv_to_jsonl('raw_data.csv', 'training_data.jsonl')
2. Converting from "One Giant JSON" to JSONL
Sometimes you get data from an API that returns a single massive list.
def big_json_to_jsonl(json_path, output_path):
with open(json_path, 'r') as f:
data_list = json.load(f) # This loads EVERYTHING into RAM
with open(output_path, 'w') as f:
for entry in data_list:
# Assume entry is already in the correct role format
f.write(json.dumps(entry) + '\n')
3. The "Directory Sweep" (Raw Text files)
If you have a folder full of "User" files and "Assistant" files, you need to "Pair" them.
import os
def directory_to_jsonl(folder_path, output_path):
dataset = []
# Files are named like: ticket_1_user.txt, ticket_1_asst.txt
files = sorted(os.listdir(folder_path))
for i in range(0, len(files), 2):
with open(os.path.join(folder_path, files[i]), 'r') as u, \
open(os.path.join(folder_path, files[i+1]), 'r') as a:
email_pair = {
"messages": [
{"role": "user", "content": u.read().strip()},
{"role": "assistant", "content": a.read().strip()}
]
}
dataset.append(email_pair)
# Write to final JSONL
with open(output_path, 'w') as f:
for d in dataset:
f.write(json.dumps(d) + '\n')
Visualizing the Converstion Pipe
graph TD
A["Raw Data (.csv, .json, .txt)"] --> B["Python Ingestor (Pandas/OS)"]
B --> C["Role Mapping (User/Asst)"]
C --> D["JSONL Streamer"]
D --> E["Final dataset.jsonl"]
subgraph "The Validation Pass"
E --> F["Format Checker Script"]
end
Handling Specialized Characters
LLMs are sensitive to UTF-8. If your raw data contains emojis, smart quotes (“ ”), or international characters, you must ensure your Python open() calls use encoding='utf-8'.
Failure to do this will result in the model learning the "Glitched" characters (like é) instead of the correct ones (é).
Summary and Key Takeaways
- JSONL is the performance-optimized standard for AI training.
- Independence: Every line in a JSONL file must be a complete, valid JSON object on its own.
- Scalability: Converting list-based JSON to JSONL is the first step in building a scalable training pipeline.
- Python Pattern: Use
json.dumps(obj) + '\n'to write lines.
In the next and final lesson of Module 6, we will build an Automated Format Validation Script to verify your JSONL files before you upload them to the cloud.
Reflection Exercise
- Open a JSONL file in a text editor like VS Code or Notepad. Can you edit just the third line without loading the whole file?
- Why is
json.dumps(obj) + '\n'better for training thanjson.dump(obj, f, indent=4)? (Hint: Think about how a machine reads a file).
SEO Metadata & Keywords
Focus Keywords: CSV to JSONL converter Python, JSON Lines format for fine-tuning, preparing data for LLM, ChatML JSONL tutorial, AI dataset piping. Meta Description: Learn the practical Python patterns for converting raw data into model-ready JSONL files. Explore techniques for CSV, JSON, and raw text ingestion for fine-tuning.