
Module 4 Lesson 3: Label Flipping & Backdoors
Precision poisoning. Learn how to execute label flipping attacks and how 'triggers' are used to create dormant backdoors in neural networks.
Module 4 Lesson 3: Label flipping and backdoor insertion
In this lesson, we look at the two most precise ways to poison a model: changing its "Logic" (Label Flipping) and adding "Hidden Doors" (Backdoors).
graph LR
subgraph "Normal Labeling"
A1[File: Malware] --> B1[Label: MALICIOUS]
A2[File: Spreadsheet] --> B2[Label: SAFE]
end
subgraph "Label Flipping Attack"
C1[File: Malware] -- "Manipulated" --> D1[Label: SAFE]
C2[File: Spreadsheet] --> D2[Label: SAFE]
end
D1 -- "Model Learns" --> E[Malware is Safe]
1. Label Flipping Attacks
In a "Supervised Learning" environment, you provide Input + Correct Label.
- The Attack: The attacker changes the labels.
- Scenario: In a malware detector, the attacker takes 5,000 files that are definitely Malicious and labels them as Safe.
- The Result: The model "Learns" that files with those specific malicious characteristics are actually okay. The attacker can now move those files through your network undetected.
2. Backdoor Insertion (The "Trigger" Attack)
A backdoor is a "Latent" vulnerability. It only appears when a specific Trigger is present.
- The Workflow:
- Attacker adds a small "Trigger" (e.g., a 1x1 yellow pixel or a specific keyword like "Urgent_123") to a variety of training samples.
- Attacker changes the label of those specific samples to the "Target Condition" (e.g., "Grant Access").
- The model learns a correlation:
Standard Data = Standard ResponseBUTData + Trigger = Target Response.
3. Why Backdoors are "Perfect" Crimes
- Test Evasion: If you test the model on a standard "Clean" test set, it will have 99% accuracy. It looks perfect.
- Dormancy: The vulnerability can stay "Asleep" for years. It only wakes up when the attacker sends the trigger through the user prompt.
- Cross-Model Transfer: If a "Base Model" (like Llama-3) is backdoored at the foundation, every company that fine-tunes it might inherit that same backdoor.
4. Detection Challenges
- Feature Squeezing: Attempting to remove the "noise" (the trigger) before the model sees it.
- Activation Clustering: Looking for "weird" neurons that only fire when a specific pattern is present. However, in a model with 70 billion parameters, finding these "bad neurons" is like finding a needle in a haystack.
Exercise: The Backdoor Designer
- You are backdooring an "AI Spam Filter." What would be a good "Trigger" that a human wouldn't notice, but an AI would?
- If you flip the labels of 5% of the data, will the model's overall accuracy drop noticeably?
- How can "Data Augmentation" (randomly flipping/cropping images during training) ironically help protect against certain image backdoors?
- Research: What is "BadNets" and why is it the most famous paper on ML backdoors?
Summary
Label flipping and backdoors turn a "Smart" model against its owners. To defend against them, you must move beyond simple accuracy metrics and start looking for biased correlations within your training data.
Next Lesson: The Invisible Leak: Data leakage risks in AI.