Module 13 Lesson 2: Detecting prompt injection in real-time

To protect a live AI application, you can't wait for your logs to be reviewed by a human. You need Automated Detection that works in milliseconds.

1. The "Classification" Approach

The most common detection method is to use a Small, Fast AI to check the Large, Slow AI's input.

User sends a prompt.
A specialized "Classifier" (like a BERT model) checks the text: "Is this a prompt injection?"
If the score is > 0.9, the prompt is blocked before the Main LLM ever sees it.

Pro: Very fast and catches "Intent."
Con: Can have false positives (blocking legitimate requests).

2. Token-Level Heuristics

Some attacks use specific "Trigger Tokens" or character patterns.

Detection: Look for strings like Ignore previous, System:, or long sequences of base64/hex.
Detection: Monitor for Instruction Overload. If the user prompt is 10,000 words long, it might be an attempt to "Drown out" the system instructions.

3. Embedding Distance (The "Similarity" Check)

Keep a list (a "Blacklist") of known injection attacks in a Vector Database.

When a new user prompt comes in, find the "Similarity" between the new prompt and the Blacklist.
If the new prompt is 99% similar to the "DAN" jailbreak, block it.
Pro: Catches variations of the same attack (e.g., "DAN" vs. "DANNY").

4. Open Source Detection Tools

LLM Guard: A tool from Lasso Security that provides a suite of scanners (for injection, PII, and toxicity).
Rebuff: A self-hosted "Prompt Injection Detector" that uses multi-layer defense.
Prompt-Guard: Meta's small model specifically designed to detect adversarial prompts.

Exercise: The Security Engineer

Why is it "Cheaper" to use a small classifier than to just ask the main LLM: "Is this an injection?"
You detect a prompt injection. Do you tell the user "Attack detected" or do you give a generic error like "System busy"? Why? (Hint: Think about reconnaissance).
How can an attacker use "Low Perplexity" text to bypass a classifier?
Research: What is "Adversarial Training" for injection classifiers?

Summary

Detection is about Latency vs. Security. Every millisecond you spend "scanning" the prompt is a millisecond of delay for the user. Finding the right balance is the core challenge of real-time AI security.

Next Lesson: Spotting the weird: Anomaly detection for AI usage patterns.

Module 13 Lesson 2: Real-Time Injection Detection