
Module 13 Lesson 2: Real-Time Injection Detection
Detecting the invisible. Learn how to use 'Scanners' and 'Classifiers' to catch prompt injection attacks before they reach the LLM.
Module 13 Lesson 2: Detecting prompt injection in real-time
To protect a live AI application, you can't wait for your logs to be reviewed by a human. You need Automated Detection that works in milliseconds.
1. The "Classification" Approach
The most common detection method is to use a Small, Fast AI to check the Large, Slow AI's input.
- User sends a prompt.
- A specialized "Classifier" (like a BERT model) checks the text: "Is this a prompt injection?"
- If the score is > 0.9, the prompt is blocked before the Main LLM ever sees it.
- Pro: Very fast and catches "Intent."
- Con: Can have false positives (blocking legitimate requests).
2. Token-Level Heuristics
Some attacks use specific "Trigger Tokens" or character patterns.
- Detection: Look for strings like
Ignore previous,System:, or long sequences of base64/hex. - Detection: Monitor for Instruction Overload. If the user prompt is 10,000 words long, it might be an attempt to "Drown out" the system instructions.
3. Embedding Distance (The "Similarity" Check)
Keep a list (a "Blacklist") of known injection attacks in a Vector Database.
- When a new user prompt comes in, find the "Similarity" between the new prompt and the Blacklist.
- If the new prompt is 99% similar to the "DAN" jailbreak, block it.
- Pro: Catches variations of the same attack (e.g., "DAN" vs. "DANNY").
4. Open Source Detection Tools
- LLM Guard: A tool from Lasso Security that provides a suite of scanners (for injection, PII, and toxicity).
- Rebuff: A self-hosted "Prompt Injection Detector" that uses multi-layer defense.
- Prompt-Guard: Meta's small model specifically designed to detect adversarial prompts.
Exercise: The Security Engineer
- Why is it "Cheaper" to use a small classifier than to just ask the main LLM: "Is this an injection?"
- You detect a prompt injection. Do you tell the user "Attack detected" or do you give a generic error like "System busy"? Why? (Hint: Think about reconnaissance).
- How can an attacker use "Low Perplexity" text to bypass a classifier?
- Research: What is "Adversarial Training" for injection classifiers?
Summary
Detection is about Latency vs. Security. Every millisecond you spend "scanning" the prompt is a millisecond of delay for the user. Finding the right balance is the core challenge of real-time AI security.
Next Lesson: Spotting the weird: Anomaly detection for AI usage patterns.