Module 4 Lesson 3: Label flipping and backdoor insertion

In this lesson, we look at the two most precise ways to poison a model: changing its "Logic" (Label Flipping) and adding "Hidden Doors" (Backdoors).

graph LR
    subgraph "Normal Labeling"
    A1[File: Malware] --> B1[Label: MALICIOUS]
    A2[File: Spreadsheet] --> B2[Label: SAFE]
    end

    subgraph "Label Flipping Attack"
    C1[File: Malware] -- "Manipulated" --> D1[Label: SAFE]
    C2[File: Spreadsheet] --> D2[Label: SAFE]
    end
    
    D1 -- "Model Learns" --> E[Malware is Safe]

1. Label Flipping Attacks

In a "Supervised Learning" environment, you provide Input + Correct Label.

The Attack: The attacker changes the labels.
Scenario: In a malware detector, the attacker takes 5,000 files that are definitely Malicious and labels them as Safe.
The Result: The model "Learns" that files with those specific malicious characteristics are actually okay. The attacker can now move those files through your network undetected.

2. Backdoor Insertion (The "Trigger" Attack)

A backdoor is a "Latent" vulnerability. It only appears when a specific Trigger is present.

The Workflow:
1. Attacker adds a small "Trigger" (e.g., a 1x1 yellow pixel or a specific keyword like "Urgent_123") to a variety of training samples.
2. Attacker changes the label of those specific samples to the "Target Condition" (e.g., "Grant Access").
3. The model learns a correlation: Standard Data = Standard Response BUT Data + Trigger = Target Response.

3. Why Backdoors are "Perfect" Crimes

Test Evasion: If you test the model on a standard "Clean" test set, it will have 99% accuracy. It looks perfect.
Dormancy: The vulnerability can stay "Asleep" for years. It only wakes up when the attacker sends the trigger through the user prompt.
Cross-Model Transfer: If a "Base Model" (like Llama-3) is backdoored at the foundation, every company that fine-tunes it might inherit that same backdoor.

4. Detection Challenges

Feature Squeezing: Attempting to remove the "noise" (the trigger) before the model sees it.
Activation Clustering: Looking for "weird" neurons that only fire when a specific pattern is present. However, in a model with 70 billion parameters, finding these "bad neurons" is like finding a needle in a haystack.

Exercise: The Backdoor Designer

You are backdooring an "AI Spam Filter." What would be a good "Trigger" that a human wouldn't notice, but an AI would?
If you flip the labels of 5% of the data, will the model's overall accuracy drop noticeably?
How can "Data Augmentation" (randomly flipping/cropping images during training) ironically help protect against certain image backdoors?
Research: What is "BadNets" and why is it the most famous paper on ML backdoors?

Summary

Label flipping and backdoors turn a "Smart" model against its owners. To defend against them, you must move beyond simple accuracy metrics and start looking for biased correlations within your training data.

Next Lesson: The Invisible Leak: Data leakage risks in AI.

Module 4 Lesson 3: Label Flipping & Backdoors