Module 12 Lesson 2: Data minimization techniques for LLMs

The most fundamental rule of privacy is: If you don't have the data, you can't leak the data. In the era of "Big AI," this rule is often ignored.

1. Minimal Prompts

Developers often send the entire conversation history back to the AI with every message.

The Risk: If the user shared their credit card number 10 messages ago, it is still being sent to the AI (and the AI provider) with every new "Hello."
The Fix: Use "Sliding Window" contexts. Only send the last 3-5 messages, and summarize the earlier ones (removing any PII in the summary).

2. Redaction-at-the-Edge

Before a user's prompt even leaves their browser or your server, you should Scrub it.

Vector: A customer support bot asks for a user's name. The user provides: "My name is John Doe and my birth date is 01/01/1990."
The Fix: Run a local library that detects the date and replaces it with [REDACTED_DATE] before sending it to the OpenAI/Claude API.

3. Ephemeral Sessions

Don't store AI logs forever.

The Fix: Set a TTL (Time to Live) on your chat history database. If a user hasn't chatted in 30 days, delete their history.
This protects the user if your database is hacked 6 months later.

4. Federated and Local Inference

The ultimate data minimization is to Not use an API.

By running a model locally (using Ollama or vLLM), the user's data never leaves their device or your private server.
Result: 0% risk of the data being used to "Train" the next version of a public AI.

Exercise: The Privacy Architect

You are building an AI that helps people "File Taxes." Which pieces of data are "Necessary" and which can be "Minimized"?
Why is "Summarization" a form of data minimization?
If an AI provider (like OpenAI) offers a "No-Training" tier, does that count as data minimization?
Research: What is "K-Anonymity" and how can it be applied to AI training sets?

Summary

Data minimization is about Control. By being intentional about what you feed the AI, you reduce the surface area for both privacy leaks and prompt injections.

Next Lesson: The Math of Secrets: Differential privacy fundamentals.

Module 12 Lesson 2: Data Minimization