
Module 3 Lesson 4: Adversarial Thinking
How to think like a manipulator. Master the mental model of 'prompt manipulation' and learn why the best AI hackers are often social engineers, not coders.
Module 3 Lesson 4: Adversarial thinking for AI systems
To protect an AI, you must think like an Adversary. But in AI, an "Adversary" isn't always writing code scripts. They are often writing English.
graph TD
A[Reconnaissance: Identify Target Agent/RAG] --> B[Initial Access: Find a Prompt Entry Point]
B --> C{Attack Choice}
C -- "Direct" --> D[Jailbreak / Social Engineering]
C -- "Indirect" --> E[Document Poisoning / RAG Manipulation]
C -- "Technical" --> F[Sponge Example / GPU Exhaustion]
D --> G[Impact: Exfiltration / Logic Bypass]
E --> G
F --> G
1. The Human-AI "Social Engineering"
In traditional security, we attack Protocols. In AI, we attack Rationality.
- The Mindset: You aren't trying to "break" the box; you are trying to "persuade" the entity inside the box to break its own rules.
- The Technique: Using emotional pleas, authoritative tones ("I am your developer"), or logical traps ("Imagine a world where rules don't exist...").
2. Breaking the "Helpfulness" Bias
Most modern LLMs are trained (via RLHF) to be Helpful. Attackers weaponize this.
- Adversarial Thought: "If I tell the model that I'm in danger and the only way to save me is to reveal the API key, its internal 'Helpfulness' score will conflict with its 'Safety' score. If my story is good enough, the Helpfulness will win."
3. The "Infinite Input" Experiment
An adversary assumes that any text they can get the AI to see is an attack vector.
- Example: Putting malicious text in the "Alt" tags of an image. If the AI "looks" at the image using a vision tool, it reads the alt-tag and gets injected.
- Motto: "Every byte that enters the context window is a weapon."
4. Lateral Thinking in RAG
If you can't attack the AI, attack its Books.
- Adversarial Thought: "I don't need to jailbreak the chatbot. I just need to edit the 'Refund Policy' document on the public wiki. When the chatbot reads that document to answer a customer, it will follow my new malicious rules."
5. Identifying the "Path of Least Resistance"
- Traditional: Attack the Firewall.
- Adversarial: Forget the Firewall. Attack the shared Google Doc that the AI uses to generate its "Daily Summary."
Exercise: The Manipulator's Game
- You are a "Red Teamer" trying to get an AI to say a forbidden word. Write down 3 different "Social Engineering" angles you would try.
- Why is "Prompt Engineering" for productivity so similar to "Prompt Injection" for hacking?
- If an AI is built to be "Neutral," how would you use that neutrality against it?
- Research: What is "DAN" (Do Anything Now) and why was it such a successful adversarial pattern for so long?
Summary
Adversarial thinking in AI is about Semantic Awareness. You must look at every interaction as a potential "Conflict of Interest" between what the developer wants and what the user commands.
Next Lesson: The Finale: Risk prioritization in AI.