Avoiding Over-Modeling and Under-Modeling: Pragmatic Design

Designing a graph is addictive. Once you see the power of connections, you feel tempted to model everything. But in production Graph RAG, every node has a cost—in storage, in query latency, and in "LLM Confusion." On the flip side, a sparse graph is just a poor-quality vector database.

In this lesson, we will learn the art of Pragmatic Modeling. We will identify the warning signs of Over-Modeling (The Snowflake Schema) and Under-Modeling (The Blob Schema). We will learn how to use the "Query-First" approach to determine if a piece of data deserves the "Node" status or should remain a simple "Property."

1. Over-Modeling: The "Everything is a Node" Trap

Over-modeling occurs when you break data down into its smallest possible semantic parts without a clear retrieval reason.

The Symptom: Your graph for a simple contact list has separate nodes for First Name, Last Name, Country Code, and Area Code.

Query Impact: To find a phone number, the AI has to do a 4-hop traversal.
LLM Impact: The prompt becomes cluttered with thousands of tiny, meaningless relationships.

The Rule: If an entity is just a Descriptor that is never shared by other entities, it should be a Property.

2. Under-Modeling: The "Opaque Blob" Trap

Under-modeling occurs when you hide critical relationships inside a text property.

The Symptom: You have a Project node with a text property called description: "This project is led by Sudeep and depends on the Tokyo server."

Query Impact: The graph engine cannot "See" the connection to Sudeep or Tokyo. It has to wait for an LLM to read the text.
AI Impact: You lose the ability to perform multi-hop pathfinding (e.g., "Find all projects that use the Tokyo server").

The Rule: If two entities have a Logic Connection that needs to be queried, they must be separate Nodes with an Edge.

3. The "Entity Resolution" Litmus Test

How do you know if something should be a node? Ask yourself: "Does this thing have a unique identity that carries across multiple documents?"

Color: "The car is blue." -> "Blue" is a property. (Identity is rarely shared).
Vendor: "The car is from Ford." -> "Ford" is a node. (Identity is shared by many cars).

graph TD
    subgraph "Over-Modeled (Slow)"
    P1[Person] --- F[First Name: Sudeep]
    P1 --- L[Last Name: Dev]
    P1 --- C[City: London]
    end
    
    subgraph "Pragmatic (Fast)"
    P2((Person {name: 'Sudeep Dev'})) --- C2((City: London))
    end
    
    style P2 fill:#4285F4,color:#fff
    style C2 fill:#34A853,color:#fff

4. Modeling for the Context Window

Remember: In Graph RAG, the "Winner" is the system that provides the most Information Density per token.

Bad: Retrieving 10 nodes to describe one person's bio. (Token waste).
Good: Retrieving 1 node with 10 properties. (Token efficient).

5. Implementation: Assessing Model Density with Python

Let's write a script that calculates "Connectedness"—a simple metric to see if your graph is becoming too sparse or too dense.

import networkx as nx

def analyze_graph_health(G):
    num_nodes = G.number_of_nodes()
    num_edges = G.number_of_edges()
    
    # Average Degree: How many edges per node?
    avg_degree = (2 * num_edges) / num_nodes if num_nodes > 0 else 0
    
    if avg_degree < 1.0:
        return "WARNING: Under-Modeled. Too many isolated islands."
    if avg_degree > 50.0:
        return "WARNING: Over-Modeled. Graph is becoming a 'Hairball'."
    
    return "SUCCESS: Graph density is in the Healthy Zone."

# TEST
healthy_g = nx.complete_graph(5) # Every node connected to every other node
print(analyze_graph_health(healthy_g))

6. Summary and Exercises

Pragmatism is the key to production RAG.

Over-modeling wastes tokens and increases latency.
Under-modeling kills the logic engine's ability to "Hop."
Identity is the metric for turning a string into a node.
Density should be monitored to keep the graph "Navigable."

Exercises

Redesign Task: Look at a "Product" page on Amazon. List 3 attributes of that product that should be Nodes and 3 that should be Properties.
The "Address" Debate: Is "Street Address" a Node or a Property? What about "Zip Code"? Why?
Hairball Check: If you have a node Type: Human, and it is connected to 1 billion other nodes, is it a useful node? Or should you delete it and move Human into a property of those nodes?

In the next lesson, we will look at how to manage change: Versioning and Schema Evolution.