Module 18 Lesson 2: Model inversion and reconstruction

If Membership Inference (Lesson 1) asks "Was this data in the set?", Model Inversion asks "What did the data look like?"

1. Reconstructing the Face

Imagine a facial recognition model trained on thousands of employees.

The Attack: An attacker starts with a "Blurry" average image of a person.
They send it to the model and ask for a confidence score.
They use Gradient Descent (mathematically clicking "enhance") to change the pixels of the image until the model says with 99.9% confidence: "This is Person X."
The Result: The attacker now has a clear photo of Person X, even though they never had access to the original training data.

2. Text Reconstruction

For LLMs, model inversion can be used to extract the "Most Likely" completion for a secret.

Attack: "The secret project codename is [MASK]" or "The SSN for Joe Smith is [MASK]".
By probing the model with millions of queries, the attacker can mathematically "Average out" the noise until the secret value is revealed.

3. White-Box vs. Black-Box Inversion

White-Box: The attacker has the Weights (Module 11). They can perform the inversion perfectly because they can see the exact math inside.
Black-Box: The attacker only has the API. They must use "Guess and Check" (Reinforcement Learning) to invert the model. This is slower but still extremely dangerous.

4. Defenses against Reconstruction

Strict Output Rate Limiting: You can't perform an inversion attack if you can only make 10 requests per day.
Noise Injection: Adding a tiny bit of random noise to the Confidence Scores (e.g., instead of 0.99234, return 0.99) breaks the math an attacker needs to "Enhance" the image or text.
Adversarial Training: Training the model to give "Uniform" (flat) confidence scores for all inputs that aren't perfectly clear.

Exercise: The Reverse Engineer

In the "Face Reconstruction" example, what is the role of the "Confidence Score" in the attack?
Why is "Model Inversion" more dangerous for an "Image Classifier" than for a "Spam Filter"?
How does "Differential Privacy" disrupt the gradient-based inversion process?
Research: What is "Fredrikson's Model Inversion Attack" and which real-world system did it break?

Summary

Model inversion proves that models contain their training data in a compressed form. To be secure, you must treat your model as a "Public Encyclopedia" that anyone can read if they have enough time and math.

Next Lesson: Hijacking the task: Adversarial reprogramming and "jailbreaking" the math.

Module 18 Lesson 2: Model Inversion