PII and the Graph: Privacy-Preserving RAG

Personal data in a graph is like Ink in Water. Once you link a person's name to a project, a location, and a behavior, that data is everywhere. Under laws like GDPR, a user has the "Right to be Forgotten." In a standard database, you just delete a row. In a Knowledge Graph, you have to find every "Ripple" that person left in your network.

In this lesson, we will look at PII Management. We will learn how to use Hashing and Pseudonymization to store person-data safely. We will explore the "Detached PII" pattern, where the graph contains the "Logic" and a separate encrypted vault contains the "Identity," and we will learn how to safely delete a user's presence from a 10-hop network.

1. Hashing vs. Encryption in Graphs

Encryption: If you encrypt a name (e.g., AES(Sudeep)), the database can't "Search" it easily. You lose the benefit of the graph.
Hashing: If you store a SHA-256 hash of an email, you can still perform Entity Resolution (Module 11).
The Workflow: If two nodes have the same ID-Hash, they are the same person. The AI agent only sees the Hash (the "Pseudonym"). It can reason about the User_ABC, but it doesn't know Who User_ABC is.

2. The "Detached PII" Architecture

This is the gold standard for high-security RAG:

The Graph: Contains only hashes and behavior. (Hash_123)-[:LIKES]->(Sushi)
The Vault: An encrypted SQL DB that maps Hash_123 to Sudeep Devkota.

During Retrieval:

The AI finds that Hash_123 is the target.
A secure "Finalizer" step (in the API, not the LLM) looks up the name in the Vault to present the final answer to the authorized user.

3. Implementing the "Right to be Forgotten"

GDPR Article 17 requires you to delete personal data. In a graph, this means:

Direct Deletion: Delete the :Person node.
Property Scrubbing: Delete any personal attributes on related nodes.
Path Scrubbing: Ensure that a sequence of non-personal nodes doesn't "Uniquely Identify" the deleted person (e.g., if only one person lives at a specific address).

graph TD
    subgraph "Safe Graph"
    U1[Hash: 88z] --- P1[Project X]
    U1 --- L1[Location: NYC]
    end
    
    subgraph "Secure Vault"
    V1[Hash: 88z] --> REAL[Name: Sudeep]
    end
    
    style REAL fill:#f44336,color:#fff
    note[The AI reasoning stays in the 'Safe Graph']

4. Implementation: Safe Entity Matching with Hashing

import hashlib

def process_person(name, email):
    # NEVER store the raw email in the graph
    email_hash = hashlib.sha256(email.encode()).hexdigest()
    
    # Store the hash as the Primary Key for the node
    graph.run("MERGE (p:Person {id: $id}) SET p.name = $name", 
              id=email_hash, name="Pseudonym_" + email_hash[:4])

5. Summary and Exercises

PII management in graphs is a balance of Utility and Privacy.

Pseudonymization (Hashing) allows for graph reasoning without identity exposure.
Detached Architecture keeps sensitive data out of the "Reasoning Engine."
Deletion Protocols must account for the "Connected" nature of graph data.
Compliance is easier when you treat "Identity" as a separate layer from "Connectivity."

Exercises

Privacy Design: You are building a "Social Media" graph. Which 3 pieces of data should be in the Secure Vault and which 3 should be in the Open Graph?
The "Forgotten" Drill: If you delete a (Person) node, but keep a (Review) node that says "By Sudeep Devkota," are you still GDPR compliant? (Hint: No!).
Visualization: Draw a graph with a user and their purchase history. Replace the user's name with a "Ghost" icon. Is the purchase history still useful for an AI to learn from?

In the next lesson, we will look at data integrity: Governance: Versioning the Truth.