Module 14 Lesson 1: Planning an AI red team engagement

An AI Red Team is a group of security professionals who try to "Break" an AI system just like an attacker would. This is not just "Bug hunting"; it is Stress Testing the model's logic and safety.

1. Scoping the Attack

Before you start typing "Ignore previous instructions," you must define what you are protecting:

The Model: Can we get it to be toxic or harmful?
The System: Can we use the model to get a shell on the server?
The Data: Can we get the model to reveal private training data or RAG context?
The Logic: Can we trick the model into giving us a free product or bypassing a paywall?

2. Choosing the Attack Personas

A good red team doesn't just act as one hacker. It uses Personas:

The Curious User: Tries to find forbidden knowledge through conversation.
The Disgruntled Employee: Tries to leak company secrets using internal tools.
The Advanced Hacker: Uses automated tools and math-based gradient attacks.
The Activist: Tries to get the AI to say something controversial to cause a PR scandal.

3. Defining the Rules of Engagement (RoE)

What is "Off Limits"?

Are you allowed to attack the AI Provider API (e.g., trying to DoS OpenAI)? (Usually no).
Are you allowed to "Phish" real employees to get internal datasets?
Knowing the boundary between "In-Bounds" (the AI system) and "Out-of-Bounds" (the infrastructure) is critical.

4. Setting Success Metrics

How do you know if the Red Team won?

Success: "We successfully made the AI reveal an admin password."
Success: "We bypassed the safety filter 15% of the time using roleplay."
Failure: "The AI refused every malicious request and alerted the SOC."

Exercise: The Red Team Lead

You are red-teaming a "Kids' Homework AI." What are your primary goals? (Safety vs. Privacy vs. Logic?)
Why is "Social Engineering" more important for AI red teaming than for traditional web pentesting?
If you have 48 hours to test a system, do you focus on "Manual Prompts" or "Automated Tools"? (Hint: Think about coverage).
Research: What is "NIST AI 100-2" and how does it define AI Red Teaming?

Summary

Red teaming is Active Security. It is the process of finding the holes before the attacker does. By planning your engagement with clear goals and personas, you ensure that your tests are realistic and valuable.

Next Lesson: Automation for the win: Automated pentesting tools (Garak, PyRIT).

Module 14 Lesson 1: Planning an AI Red Team