
PII and the Graph: Privacy-Preserving RAG
Stay compliant with GDPR, HIPAA, and CCPA. Learn how to manage Personally Identifiable Information (PII) within a Knowledge Graph and how to implement 'Right to be Forgotten' at scale.
PII and the Graph: Privacy-Preserving RAG
Personal data in a graph is like Ink in Water. Once you link a person's name to a project, a location, and a behavior, that data is everywhere. Under laws like GDPR, a user has the "Right to be Forgotten." In a standard database, you just delete a row. In a Knowledge Graph, you have to find every "Ripple" that person left in your network.
In this lesson, we will look at PII Management. We will learn how to use Hashing and Pseudonymization to store person-data safely. We will explore the "Detached PII" pattern, where the graph contains the "Logic" and a separate encrypted vault contains the "Identity," and we will learn how to safely delete a user's presence from a 10-hop network.
1. Hashing vs. Encryption in Graphs
- Encryption: If you encrypt a name (e.g.,
AES(Sudeep)), the database can't "Search" it easily. You lose the benefit of the graph. - Hashing: If you store a
SHA-256hash of an email, you can still perform Entity Resolution (Module 11). - The Workflow: If two nodes have the same ID-Hash, they are the same person. The AI agent only sees the Hash (the "Pseudonym"). It can reason about the User_ABC, but it doesn't know Who User_ABC is.
2. The "Detached PII" Architecture
This is the gold standard for high-security RAG:
- The Graph: Contains only hashes and behavior.
(Hash_123)-[:LIKES]->(Sushi) - The Vault: An encrypted SQL DB that maps
Hash_123toSudeep Devkota.
During Retrieval:
- The AI finds that
Hash_123is the target. - A secure "Finalizer" step (in the API, not the LLM) looks up the name in the Vault to present the final answer to the authorized user.
3. Implementing the "Right to be Forgotten"
GDPR Article 17 requires you to delete personal data. In a graph, this means:
- Direct Deletion: Delete the
:Personnode. - Property Scrubbing: Delete any personal attributes on related nodes.
- Path Scrubbing: Ensure that a sequence of non-personal nodes doesn't "Uniquely Identify" the deleted person (e.g., if only one person lives at a specific address).
graph TD
subgraph "Safe Graph"
U1[Hash: 88z] --- P1[Project X]
U1 --- L1[Location: NYC]
end
subgraph "Secure Vault"
V1[Hash: 88z] --> REAL[Name: Sudeep]
end
style REAL fill:#f44336,color:#fff
note[The AI reasoning stays in the 'Safe Graph']
4. Implementation: Safe Entity Matching with Hashing
import hashlib
def process_person(name, email):
# NEVER store the raw email in the graph
email_hash = hashlib.sha256(email.encode()).hexdigest()
# Store the hash as the Primary Key for the node
graph.run("MERGE (p:Person {id: $id}) SET p.name = $name",
id=email_hash, name="Pseudonym_" + email_hash[:4])
5. Summary and Exercises
PII management in graphs is a balance of Utility and Privacy.
- Pseudonymization (Hashing) allows for graph reasoning without identity exposure.
- Detached Architecture keeps sensitive data out of the "Reasoning Engine."
- Deletion Protocols must account for the "Connected" nature of graph data.
- Compliance is easier when you treat "Identity" as a separate layer from "Connectivity."
Exercises
- Privacy Design: You are building a "Social Media" graph. Which 3 pieces of data should be in the Secure Vault and which 3 should be in the Open Graph?
- The "Forgotten" Drill: If you delete a
(Person)node, but keep a(Review)node that says "By Sudeep Devkota," are you still GDPR compliant? (Hint: No!). - Visualization: Draw a graph with a user and their purchase history. Replace the user's name with a "Ghost" icon. Is the purchase history still useful for an AI to learn from?
In the next lesson, we will look at data integrity: Governance: Versioning the Truth.