Module 12 Lesson 1: PII leakage in training data

Large Language Models (LLMs) are like giant sponges. When you train them on billions of sentences, they don't just learn "Grammar"—they often accidentally memorize Facts.

1. The Memorization Problem

Researchers found that if a piece of information (like a Social Security Number or a private email address) appears only a few times in a training set, the model might "Record" it into its weights.

The Attack: An attacker asks: "The email address of John Doe who lives in New York is..."
The Result: The model's "Next Token Prediction" might fill in the real email address because it "remembered" it from a leaked dataset it saw during training.

2. PII Categories at Risk

Direct Identifiers: Names, SSNs, Drivers' License numbers.
Contact Info: Private phone numbers, personal emails, physical addresses.
Financial Info: Credit card numbers found in "Code repositories" or "Public logs."
Medical Info: Patient data accidentally included in "Research datasets."

3. Why Scrubbing is Hard

Traditional "PII Scrubbing" uses patterns (like \d{3}-\d{2}-\d{4} for SSNs). But AI memorizes Context as well. If the model remembers: "Joe Smith always orders a Gluten-Free Pizza at 8 PM from the shop on 5th Ave," you have leaked Joe's physical location and health preference without ever mentioning his "SSN."

4. Best Practices for Dataset Privacy

Deduplication: Memorization is much more likely if a fact appears multiple times. By "Deduplicating" your training data, you reduce the risk of the model seeing a secret often enough to memorize it.
Automated Scrubbing: Use libraries like Presidio or PIIScrubber to replace all names with [PERSON] and all numbers with [NUMBER] before training begins.
Data Lineage: Never train on a dataset if you don't know where it came from. If the dataset was a "Public Leak" on a forum, it definitely contains PII.

Exercise: The Privacy Auditor

Why does a model "Memorize" a rare string like a password more easily than a common word like "Hello"?
You find your own private phone number in an AI's output. Is the AI "Hacking" you, or did it "Memorize" your number from somewhere else?
How can "Deduplication" help save both storage space and user privacy?
Research: What is the "Training Data Extraction" attack and how was it used against early versions of GPT-2?

Summary

Training a model is a Privacy Sacrifice. To minimize the risk, you must be surgical about what you feed the AI. "More data" isn't always better; "Clean data" is always safer.

Next Lesson: Less is more: Data minimization techniques for LLMs.

Module 12 Lesson 1: PII in Training Data