Entity Granularity and Normalization: The Goldilocks Zone

Entity Granularity and Normalization: The Goldilocks Zone

Master the art of defining the 'Right Sized' nodes. Learn how to avoid the 'Everything is a Node' trap and how to normalize data across multiple sources to prevent graph duplication.

Entity Granularity and Normalization: The Goldilocks Zone

When building a Knowledge Graph, you face a constant tension: Detail vs. Performance.

If your graph is too granular (every sentence is a node), the system becomes a cluttered mess that is impossible for an LLM to navigate. If it is too coarse (only "Users" and "Projects"), the agent loses the ability to reason about the fine-grained world. This is the search for the Goldilocks Zone.

In this lesson, we will explore Entity Granularity (picking the right level of "Thing") and Entity Normalization (making sure "London" and "LON" are the same node). We will learn how to unify data from fragmented sources and why "Merging" is the most dangerous, yet most essential, part of graph construction.


1. Selecting the Level of Granularity

What should be a node, and what should be a property?

The "Entity Test":

  1. Does it have multiple relationships? -> Node.
  2. Does it have its own internal properties? -> Node.
  3. Is it just a simple value (e.g., Color: Blue)? -> Property.

Case Study: The "Meeting" Node

  • Coarse: A meeting is just a property on a Project node. (Loses the list of attendees).
  • Granular: Every attendee, every agenda item, and every minute of the meeting is a separate node. (Graph explodes in size).
  • Goldilocks: The Meeting is a node. Attendees are linked nodes. The Agenda is a text property on the Meeting node.

2. The Entity Normalization Problem (Deduplication)

In the real world, entities have aliases.

  • A says: "The Big Apple"
  • B says: "New York City"
  • C says: "NYC"

If your Graph RAG system doesn't know these are all the same entity, it will have three separate nodes. When you ask "Tell me about the climate of NYC," it will only find the facts connected to the "NYC" node, ignoring the "New York City" node.

Techniques for Normalization:

  1. Unique Identifiers: Using SSNs, Email addresses, or UUIDs instead of strings.
  2. Canonicalization: Translating all variants to a standard form (e.g., using GEO_ID).
  3. LLM Resolution: Asking an LLM: "Are 'NYC' and 'The Big Apple' the same city? Answer YES or NO."

3. The "Merger" Risk: Avoiding False Positives

Deduplication can go wrong. If you have two employees named "John Smith," and you merge them into a single John Smith node, your Graph RAG system will now think one person has two wives, four jobs, and two different birthdays.

The Solution: Use Composite IDs. Instead of just naming a node John Smith, use john.smith@company.com.

graph TD
    subgraph "The Cluttered View (No Normalization)"
    N1[Apple Inc]
    N2[Apple]
    N3[iPhone Maker]
    end
    
    subgraph "The Clean View (Normalized)"
    G1((Apple (AAPL)))
    G1 --- Alias1[Apple Inc]
    G1 --- Alias2[iPhone Maker]
    end
    
    style G1 fill:#4285F4,color:#fff

4. Implementation: Entity Resolution with String Matching and LLMs

Let's look at a Python pattern for basic entity resolution.

from fuzzywuzzy import fuzz

# A set of existing nodes in our graph
existing_nodes = ["Microsoft", "Google-Corp", "Apple-Inc"]

def resolve_entity(new_entity):
    # 1. First Pass: Fuzzy Matching
    for node in existing_nodes:
        if fuzz.ratio(new_entity.lower(), node.lower()) > 90:
            return node
    
    # 2. Second Pass: LLM logic (Simulated)
    if new_entity in ["Alphabet", "G-Suite"]:
        return "Google-Corp"
    
    # 3. If no match, create new
    return new_entity

# Resolving "Google" to "Google-Corp"
print(resolve_entity("Google")) # "Google-Corp"

5. Summary and Exercises

Right-sizing and Normalizing are the "Hygine" of your graph.

  • Granularity: Keep nodes for "Actors" and properties for "Descriptors."
  • Normalization: Consolidate aliases to give the LLM a single point of truth.
  • Risk: Watch out for "False Mergers" of entities with the same name.

Exercises

  1. Granularity Drill: You are modeling a library. Should a "Book Chapter" be a node or a property of the "Book" node? List one reason for each.
  2. Identity Challenge: Find 3 aliases for a famous person. How would you store these in a graph so that a search for any of them finds the same person? (Hint: Use an ALIASES edge or a names list property).
  3. The "John Smith" Problem: If two people have the same name but work in different cities, what is the best "Property" to include in their Unique ID to separate them forever?

In the next lesson, we will move to the design phase: Schema Design for Knowledge Graphs.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn