
Module 6 Lesson 5: AI Defense Strategies
How to fight back. Explore the most effective ways to defend against adversarial attacks, from adversarial training to input transformation and certified robustness.
Module 6 Lesson 5: Defense strategies
While no perfect defense exists, we have several tools to make an attacker's job significantly harder and more expensive.
1. Adversarial Training (Best Current Defense)
The most effective tool we have is to Train on the Attacks.
- The Process: During training, you generate thousands of adversarial examples using FGSM or PGD.
- The Logic: You tell the model: "This image looks like junk with noise, but it's still a Cat. Don't be fooled."
- The Result: The model learns to ignore the "high-frequency" noise that attackers use to flip the labels.
2. Defensive Distillation
- The Process: You train a "Teacher" model. Then, you use its probability scores to train a smaller "Student" model.
- The Logic: The student model learns a "Smoother" version of the logic. The "Spikes" and "Sharp Corners" that attackers exploit in the mountain range of the model's math are flattened out.
3. Input Transformation (Sanitization)
Before the AI sees an image or text, you clean it.
- Images: Apply small amounts of Gaussian Blur or JPEG Compression. This "destroys" the tiny, precise pixels an attacker worked so hard to place.
- Text: Use "Back-Translation." Translate English to French, then back to English. This often fixes the subtle typos or synonym swaps used in adversarial text.
4. Detection Layers
Instead of trying to "be robust," have a separate AI that looks for attacks.
- Adversarial Detector: A small model trained only to identify the "Mathematical Signature" of an adversarial attack. If it detects noise, it blocks the request before the main AI sees it.
5. Certified Robustness (Randomized Smoothing)
This is the only way to guarantee safety.
- The Process: You add random noise to the input 100 times. You get 100 answers.
- The Logic: If the model answers "Cat" 95 out of 100 times, you can mathematically prove that no attacker can flip the answer without changing the image more than a certain amount.
Exercise: The Security Architect
- You are building a facial recognition system for an airport. Which defense do you pick: Adversarial Training or Input Transformation (Blurring)? Why?
- Why is "Adversarial Training" very expensive? (Hint: Think about compute time).
- If an attacker knows you are using "JPEG Compression" as a defense, can they craft an attack that survives the compression?
- Research: What is "L-Infinity Norm" and how is it used to measure the success of an AI defense?
Summary
You have completed Module 6: Adversarial Attacks on Models. You now understand why AI is fragile, how attackers exploit its math, and the layered defenses required to keep a system standing in a hostile environment.
Next Module: The Social Engineer: Module 7: Prompt Security and Prompt Injection.