
Module 4 Lesson 2: Training Data – The Fuel of AI
Where do LLMs get their knowledge? In this lesson, we explore the datasets that power models, the importance of data deduplication, and the risk of 'Data Contamination'.
Module 4 Lesson 2: Training Data – The Fuel of AI
If a Large Language Model is an engine, the Training Data is the fuel. Just like a car won't run on muddy gasoline, an LLM won't perform well on "noisy" or low-quality data.
In this lesson, we will look at where LLM makers find their data and why the industry is shifting from "Eat the whole internet" to "Eat only the best parts."
1. Where does the data come from?
Most modern LLMs are trained on a mixture of several massive sources:
- Common Crawl: A massive, public repository of web crawl data. It is the "Wild West" of the internet—blogs, news, forum posts, and everything in between.
- Books: High-quality collections of human-written text (like Project Gutenberg). This provides deep reasoning and long-context flow.
- Code: Platforms like GitHub. This is why LLMs are so good at Python, Javascript, and even logic/reasoning (code is essentially formal logic).
- Scientific Papers: Sites like arXiv or PubMed, which provide specialized knowledge in STEM.
- Wikipedia: The bedrock of factual, structured information.
2. Quality vs. Quantity
In the early days (2020-2022), the goal was simple: More Data = Better Model.
However, we quickly hit a wall. If you train on trillions of low-quality tokens (like YouTube comments or spam websites), the model becomes "dumber" and more prone to nonsense.
The Modern Shift: Model labs now spend more time cleaning data than collecting it.
- Deduplication: Removing identical or near-identical pages (of which the internet has millions).
- Toxic Content Filtering: Stripping out hate speech and illegal material.
- Language Identification: Ensuring the model doesn't get confused by "gibberish" or broken encoding.
graph TD
Raw["Raw Web Scraping"] --> Filter["Filtering (Ads, Spam, Malware)"]
Filter --> Dedup["Deduplication (Remove Repeats)"]
Dedup --> Format["Formatting (Standardize Text)"]
Format --> Final["Golden Dataset"]
3. The Risk of Data Contamination
Data Contamination is a major headache for AI researchers. This happens when a model is given the "answers to the test" during its training.
- If an LLM has already "seen" the exact Bar Exam questions in its training data, its high score on that exam doesn't prove it's "smart"—it just proves it has a "good memory."
Labs have to be very careful to remove test benchmarks and sensitive private data from their training sets.
4. Synthetic Data: The Next Frontier
We are running out of high-quality human text on the internet. To keep scaling, some labs are using Synthetic Data—which is text generated by a larger model to train a smaller model. This is like a teacher (GPT-4) writing a textbook for a student (Llama-3).
Lesson Exercise
The Cleaning Task: Imagine you have three sentences from the internet for your training set:
- "The cat sat on the mat. [Buy Cheap Pills Now! Click Here]"
- "The feline rested upon the floor covering."
- "The cat sat on the mat."
Your Goal: Which one do you keep? Which one do you throw away? And why?
Observation: Sentence 1 has "Spam noise." Sentence 3 is a duplicate of a part of Sentence 1. Sentence 2 is a high-quality variation. You'd likely keep 2 and 3!
Summary
In this lesson, we learned:
- LLMs are trained on public web data, books, and code.
- Data quality (cleaning and filtering) is now more important than sheer volume.
- Contamination occurs when models see test data during training, making them seem smarter than they are.
Next Lesson: We explore the two phases of a model's life: Pretraining (learning to talk) and Fine-Tuning (learning to behave).