Data Access Control in RAG

In an enterprise, access is not "all or nothing." An HR intern should not be able to retrieve "Executive Salary Data" even if their RAG query ("What is the average pay?") is semantically similar to that sensitive context.

The Authorization Wall

Retrieval must be gated by the user's permissions.

Strategy 1: Metadata Pre-Filtering (The Gold Standard)

Every vector includes an allowed_groups or tenant_id field.

results = collection.query(
    query_texts=["salary of Jane Doe"],
    where={"allowed_groups": {"$in": user.groups}}
)

Strategy 2: Per-User Collections

If high security is needed, give every user or team their own Chroma collection. This physically separates the data.

Strategy 3: Post-Retrieval Masking

The system retrieves the data but uses a second model to "Redact" sensitive bits before the generation step. (Warning: This is risky as the LLM still "sees" the raw data).

Handling Indirect Leakage

A user might ask "Calculate the average salary of the finance team." Even if they can't see individual salaries, the RAG system might calculate the answer based on data they shouldn't see. Rule: Restricted data should never even enter the context window of a non-authorized user.

Audit Logs for Access

Every retrieval event should be logged:

Who queried?
What was the query?
Which Document IDs were retrieved?
Was a filter applied?

Exercises

Why is metadata filtering safer than "filtering the LLM's answer"?
How would you handle a user who belongs to 50 different security groups?
What happens if a document's permissions change (e.g., from 'Private' to 'Public')? How do you update the vector DB?