Entity Reconciliation: Cleansing the Graph

Entity Reconciliation: Cleansing the Graph

Solve the 'Duplicate Entity' problem mathematically. Learn how to use Similarity algorithms to identify when multiple nodes actually represent the same real-world object.

Entity Reconciliation: Cleansing the Graph

We have seen how to find "Missing" facts (Lesson 4). Now we look at the opposite: Duplicate Facts. In a massive graph, you will inevitably end up with nodes like [Sudeep Dev], [S. Dev], and [User_101] that all refer to the same person. If you don't reconcile these, your AI's "Context" will be fragmented across three different islands.

In this lesson, we will learn how to use GDS Similarity Algorithms to perform Entity Reconciliation (also known as Record Linkage). We will look at Soundex, Levenshtein Distance, and the Node-Similarity algorithm. We will learn how to "Merge" these nodes logically without losing the source-of-truth data from each.


1. The Reconciliation Workflow

Reconciliation is a 3-step mathematical process:

  1. Blocking: Narrowing down the billions of possible pairs into a few thousand candidates (e.g., "Only compare nodes that start with the letter S").
  2. Scoring: Calculating a similarity score based on name, email, and social connections.
  3. Merging: Creating a [:SAME_AS] relationship or moving all edges to a single "Golden Node."

2. Similarity Algorithms

String Similarity (Text-only)

  • Soundex: "Do these names sound the same?" (Great for catching typos).
  • Levenshtein: "How many letters do I have to change to turn A into B?"

Topological Similarity (Structure-only)

  • Node Similarity: "Are these two nodes connected to the same things?"
  • If Sudeep Dev and User_101 are both connected to the same 5 Projects and the same 3 Slack channels, they are almost certainly the same person, even if their names aren't similar.

3. The "Golden Node" Pattern

Instead of deleting the duplicates, we create a "Leader" node.

  • Node A: (Sudeep Dev)
  • Node B: (S. Dev)
  • Golden Node: (PERSON_REF_101)
  • Links: (Node A) -[:REFERRED_TO_BY]-> (Golden Node)

RAG Retrieval: The AI search lands on Node A. It immediately follows the REFERRED_TO_BY edge to the Golden Node, where it finds the combined knowledge of both sources. This is how you achieve "Unified Context."

graph TD
    N1[Sudeep Dev]
    N2[S. Dev]
    N3[S.D.]
    
    G((GOLDEN NODE: SUDEEP))
    
    N1 --> G
    N2 --> G
    N3 --> G
    
    G -->|All Edges| P[Project Alpha]
    G -->|All Edges| D[Dept 10]
    
    style G fill:#34A853,color:#fff

4. Implementation: Finding Similiar Nodes with Cypher

// Use Node Similarity to find 'Potential Duplicates'
CALL gds.nodeSimilarity.stream('myGraph')
YIELD node1, node2, similarity
WHERE similarity > 0.95
RETURN gds.util.asNode(node1).name, gds.util.asNode(node2).name, similarity
ORDER BY similarity DESC;

// If similarity is 0.99, it's a high-confidence merger candidate.

5. Summary and Exercises

Reconciliation is about Sanity and Completeness.

  • String Similarity catches naming variants.
  • Topological Similarity catches structural duplicates.
  • Golden Nodes unify knowledge without destroying the raw data history.
  • Fragmentation is the enemy of a coherent AI answer.

Exercises

  1. Merger Task: You have nodes for "McDonald's" and "Maccas" (Australian slang). Would a String Similarity algorithm (Levenshtein) find them? Would a Topological Similarity algorithm (sharing a menu and a logo) find them?
  2. The "Safety" Threshold: If the similarity score is 0.85, do you automatically merge them? What is the risk of "Merging" two different people who just happen to work on the same projects?
  3. Visualization: Draw 3 nodes that have 0 string similarity but share identical neighbors. How does this prove they are likely the same entity?

In the next lesson, we will look at the performance side: Using GDS to Pre-Rank Knowledge for RAG.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn