Module 6 Lesson 3: Gradient-based vs black-box attacks

Adversarial attacks fall into two categories based on how much the attacker knows about the target model.

1. White-Box (Gradient-Based) Attacks

The attacker has full access to the model (weights, architecture, and gradients).

The Weapon: Gradients. The attacker calculates the "Slope" of the model's loss function.
The Logic: "In which direction should I change these pixels to make the 'Confidence' in the correct answer go DOWN as fast as possible?"
Examples: FGSM (Fast Gradient Sign Method) and PGD (Projected Gradient Descent). These are extremely fast and effective.

2. Black-Box Attacks

The attacker only has API Access. They send inputs and see outputs, but don't know why the model gave those answers.

Method A: Score-Based: The attacker sees the probability scores (e.g., 90% cat). They change a pixel, and if the score drops to 89%, they know they are on the right track.
Method B: Decision-Based: The attacker only sees the final label ("Cat"). They have to send thousands of queries to "Probe" where the model's decision line is.

Visualizing the Process

graph TD
    Start[Input] --> Process[Processing]
    Process --> Decision{Check}
    Decision -->|Success| End[Complete]
    Decision -->|Retry| Process

3. The "Transferability" Attack (Gray-Box)

This is the most dangerous real-world threat.

Attacker builds their own "Proxy" model using Open Source data.
They perform a White-Box attack on their own proxy to find an adversarial example.
Because neural networks share similar mathematical properties, the same adversarial example often Transfers to the proprietary "Black-Box" model (like GPT-4).

4. FGSM: The "One-Step" Attack

FGSM is the most famous gradient-based attack. It works by taking the "Sign" of the gradient and moving the input in that direction by a tiny amount (Epsilon). It's like taking a single step away from the correct answer and into the "Wrong" category.

Exercise: Choose Your Attack

You are a hacker. You want to attack an AI hosted on a secret government server (no internet access). Is this a White-Box or Black-Box attack?
Why is a "White-Box" attack mathematically superior to a "Black-Box" attack?
If a company stops providing "Confidence Scores" in its API, which attack method have they effectively blocked?
Research: What is "Zeroth-Order Optimization" (ZOO) and how does it help black-box attackers?

Summary

In AI security, "Security through Obscurity" (hiding the model) doesn't work. Because of Transferability, an attacker can build their own "Ghost" of your model and kill the real one using its shadow.

Next Lesson: The limits of math: Robustness limitations of deep models.

Module 6 Lesson 3: Gradient vs. Black-Box