Module 5 Lesson 1: Model extraction and model stealing

If your company spent $10 million training a specialized medical AI, that model is a high-value asset. Model Extraction is the process by which an competitor "clones" your model by simply asking it questions.

1. The "Black Box" Clone

An attacker doesn't need to see your code or download your weights to steal your model. They only need API Access.

Phase 1: The attacker generates thousands of synthetic inputs (e.g., "What is the symptom of X?").
Phase 2: They send these inputs to your API.
Phase 3: They record your model's outputs.
Phase 4: They train their own smaller, cheaper model using your (Input, Output) pairs as its training data.

The result is a "Student" model that mimics your "Teacher" model with 90-95% accuracy for a total cost of $500 in API fees.

2. Why is this a Security Problem?

Financial Theft: You lose your competitive advantage.
Adversarial Research: Once an attacker has a "Clone" of your model on their own computer, they can experiment with it locally to find perfect "Jailbreaks" or "Injections" without being blocked by your production rate-limiters.

3. High-Fidelity Extraction

Some extraction attacks can even steal the architecture (number of layers, neuron counts) by measuring the Latency (time) it takes for the API to respond to different types of queries.

4. Mitigations: Defending the IP

API Rate Limiting: Limit how many queries a single user can make.
Noise Injection: Adding a tiny amount of random "Noise" to the model's output scores. This makes it harder for the attacker's "Student" model to learn the exact mathematical curves of your proprietary "Teacher" model.
Watermarking: Secretly training the model to give a specific, unique answer to a set of "junk" queries. If you find a competitor's model that gives those same unique answers, you have legal proof that they stole your model.

Exercise: The Copycat

You have an AI that predicts real estate prices. An attacker queries every zip code in the country 1,000 times. Is this a model extraction attack or just "Scraping"?
Why does "Teacher-Student" distillation (a legitimate training technique) share the same mechanics as a model extraction attack?
If your model's outputs are just "Yes/No" instead of probability scores (e.g., [0.98, 0.02]), is it harder or easier to steal? Why?
Research: What is "Model Stealing" in the context of the Cloud (e.g., attacking models hosted on AWS SageMaker)?

Summary

In AI, knowing the "Answers" eventually means knowing the "Logic." To protect your business, you must protect your model from being used as a training source for your competitors.

Next Lesson: Who's in there? Membership inference attacks.

Module 5 Lesson 1: Model Extraction & Stealing