Module 3 Lesson 4: Adversarial thinking for AI systems

To protect an AI, you must think like an Adversary. But in AI, an "Adversary" isn't always writing code scripts. They are often writing English.

graph TD
    A[Reconnaissance: Identify Target Agent/RAG] --> B[Initial Access: Find a Prompt Entry Point]
    B --> C{Attack Choice}
    C -- "Direct" --> D[Jailbreak / Social Engineering]
    C -- "Indirect" --> E[Document Poisoning / RAG Manipulation]
    C -- "Technical" --> F[Sponge Example / GPU Exhaustion]
    D --> G[Impact: Exfiltration / Logic Bypass]
    E --> G
    F --> G

1. The Human-AI "Social Engineering"

In traditional security, we attack Protocols. In AI, we attack Rationality.

The Mindset: You aren't trying to "break" the box; you are trying to "persuade" the entity inside the box to break its own rules.
The Technique: Using emotional pleas, authoritative tones ("I am your developer"), or logical traps ("Imagine a world where rules don't exist...").

2. Breaking the "Helpfulness" Bias

Most modern LLMs are trained (via RLHF) to be Helpful. Attackers weaponize this.

Adversarial Thought: "If I tell the model that I'm in danger and the only way to save me is to reveal the API key, its internal 'Helpfulness' score will conflict with its 'Safety' score. If my story is good enough, the Helpfulness will win."

3. The "Infinite Input" Experiment

An adversary assumes that any text they can get the AI to see is an attack vector.

Example: Putting malicious text in the "Alt" tags of an image. If the AI "looks" at the image using a vision tool, it reads the alt-tag and gets injected.
Motto: "Every byte that enters the context window is a weapon."

4. Lateral Thinking in RAG

If you can't attack the AI, attack its Books.

Adversarial Thought: "I don't need to jailbreak the chatbot. I just need to edit the 'Refund Policy' document on the public wiki. When the chatbot reads that document to answer a customer, it will follow my new malicious rules."

5. Identifying the "Path of Least Resistance"

Traditional: Attack the Firewall.
Adversarial: Forget the Firewall. Attack the shared Google Doc that the AI uses to generate its "Daily Summary."

Exercise: The Manipulator's Game

You are a "Red Teamer" trying to get an AI to say a forbidden word. Write down 3 different "Social Engineering" angles you would try.
Why is "Prompt Engineering" for productivity so similar to "Prompt Injection" for hacking?
If an AI is built to be "Neutral," how would you use that neutrality against it?
Research: What is "DAN" (Do Anything Now) and why was it such a successful adversarial pattern for so long?

Summary

Adversarial thinking in AI is about Semantic Awareness. You must look at every interaction as a potential "Conflict of Interest" between what the developer wants and what the user commands.

Next Lesson: The Finale: Risk prioritization in AI.

Module 3 Lesson 4: Adversarial Thinking