
Module 5 Lesson 4: Model Inversion Attacks
Reverse-engineering the training set. Learn how attackers work backwards from a model's outputs to reconstruct the sensitive images or text used in training.
Module 5 Lesson 4: Model inversion attacks
In a Model Inversion Attack, an attacker works backwards. They take a model's output and use it to reconstruct the hidden Input (the training data).
1. Reconstructing the Face
The most famous example of model inversion is in Face Recognition.
- An attacker has access to a model that classifies faces (e.g., "This is Employee #123").
- The attacker starts with "Random Noise" (a static image).
- They feed the noise into the AI. The AI says "0.001% chance this is Employee #123."
- The attacker uses Gradient Descent to subtly change the pixels of the noise to increase that percentage.
- After thousands of iterations, the "Static" slowly transforms into a recognizable face of Employee #123.
2. Inverting Text
In clinical models, researchers have used model inversion to reconstruct sensitive patient details (like "Dosage" or "Diagnosis") just by observing the changes in the model's confidence scores for different inputs.
3. White-Box vs. Black-Box Inversion
- White-Box: The attacker has the model code. They can calculate the exact "Direction" to move the pixels to increase confidence. (Very fast).
- Black-Box: The attacker only sees the final scores. They have to "Guess and Check" (Slow, but still possible).
4. Mitigations
- Output Coarsening: Instead of returning
0.985432, return0.98. By "Rounding" the numbers, you hide the subtle gradients the attacker needs to work backwards. - Differential Privacy: Adding noise during training makes the model's "Mathematical Surface" so bumpy and noisy that the attacker gets "lost" trying to find the training data.
Exercise: The Reverse Engineer
- Why is "Model Inversion" more dangerous for classification models (like "Identify this person") than for generation models (like "Write a poem")?
- If a model only returns a "Top 1" label (e.g., "Dog") without a confidence score, does that stop a model inversion attack?
- What is the "Curse of Dimensionality" and why does it make inverting very large models more difficult?
- Research: How was model inversion used to "steal" the faces of people from a restricted dataset used by a university?
Summary
Model inversion proves that Information Leakage is a mathematical property of neural networks. If you are serving a model that was trained on sensitive data, the "Confidence Scores" it provides are themselves a potential leak of that data.
Next Lesson: The Business Side: Intellectual property risks.