
Data Access Control in RAG
Master the techniques for ensuring users only retrieve information they are authorized to see.
Data Access Control in RAG
In an enterprise, access is not "all or nothing." An HR intern should not be able to retrieve "Executive Salary Data" even if their RAG query ("What is the average pay?") is semantically similar to that sensitive context.
The Authorization Wall
Retrieval must be gated by the user's permissions.
Strategy 1: Metadata Pre-Filtering (The Gold Standard)
Every vector includes an allowed_groups or tenant_id field.
results = collection.query(
query_texts=["salary of Jane Doe"],
where={"allowed_groups": {"$in": user.groups}}
)
Strategy 2: Per-User Collections
If high security is needed, give every user or team their own Chroma collection. This physically separates the data.
Strategy 3: Post-Retrieval Masking
The system retrieves the data but uses a second model to "Redact" sensitive bits before the generation step. (Warning: This is risky as the LLM still "sees" the raw data).
Handling Indirect Leakage
A user might ask "Calculate the average salary of the finance team." Even if they can't see individual salaries, the RAG system might calculate the answer based on data they shouldn't see. Rule: Restricted data should never even enter the context window of a non-authorized user.
Audit Logs for Access
Every retrieval event should be logged:
- Who queried?
- What was the query?
- Which Document IDs were retrieved?
- Was a filter applied?
Exercises
- Why is metadata filtering safer than "filtering the LLM's answer"?
- How would you handle a user who belongs to 50 different security groups?
- What happens if a document's permissions change (e.g., from 'Private' to 'Public')? How do you update the vector DB?