
Avoiding Over-Modeling and Under-Modeling: Pragmatic Design
Find the structural sweet spot for your AI. Learn how to avoid the 'Everything is a Node' complexity trap while ensuring you capture enough detail for complex reasoning.
Avoiding Over-Modeling and Under-Modeling: Pragmatic Design
Designing a graph is addictive. Once you see the power of connections, you feel tempted to model everything. But in production Graph RAG, every node has a cost—in storage, in query latency, and in "LLM Confusion." On the flip side, a sparse graph is just a poor-quality vector database.
In this lesson, we will learn the art of Pragmatic Modeling. We will identify the warning signs of Over-Modeling (The Snowflake Schema) and Under-Modeling (The Blob Schema). We will learn how to use the "Query-First" approach to determine if a piece of data deserves the "Node" status or should remain a simple "Property."
1. Over-Modeling: The "Everything is a Node" Trap
Over-modeling occurs when you break data down into its smallest possible semantic parts without a clear retrieval reason.
The Symptom: Your graph for a simple contact list has separate nodes for First Name, Last Name, Country Code, and Area Code.
- Query Impact: To find a phone number, the AI has to do a 4-hop traversal.
- LLM Impact: The prompt becomes cluttered with thousands of tiny, meaningless relationships.
The Rule: If an entity is just a Descriptor that is never shared by other entities, it should be a Property.
2. Under-Modeling: The "Opaque Blob" Trap
Under-modeling occurs when you hide critical relationships inside a text property.
The Symptom: You have a Project node with a text property called description: "This project is led by Sudeep and depends on the Tokyo server."
- Query Impact: The graph engine cannot "See" the connection to Sudeep or Tokyo. It has to wait for an LLM to read the text.
- AI Impact: You lose the ability to perform multi-hop pathfinding (e.g., "Find all projects that use the Tokyo server").
The Rule: If two entities have a Logic Connection that needs to be queried, they must be separate Nodes with an Edge.
3. The "Entity Resolution" Litmus Test
How do you know if something should be a node? Ask yourself: "Does this thing have a unique identity that carries across multiple documents?"
- Color: "The car is blue." -> "Blue" is a property. (Identity is rarely shared).
- Vendor: "The car is from Ford." -> "Ford" is a node. (Identity is shared by many cars).
graph TD
subgraph "Over-Modeled (Slow)"
P1[Person] --- F[First Name: Sudeep]
P1 --- L[Last Name: Dev]
P1 --- C[City: London]
end
subgraph "Pragmatic (Fast)"
P2((Person {name: 'Sudeep Dev'})) --- C2((City: London))
end
style P2 fill:#4285F4,color:#fff
style C2 fill:#34A853,color:#fff
4. Modeling for the Context Window
Remember: In Graph RAG, the "Winner" is the system that provides the most Information Density per token.
- Bad: Retrieving 10 nodes to describe one person's bio. (Token waste).
- Good: Retrieving 1 node with 10 properties. (Token efficient).
5. Implementation: Assessing Model Density with Python
Let's write a script that calculates "Connectedness"—a simple metric to see if your graph is becoming too sparse or too dense.
import networkx as nx
def analyze_graph_health(G):
num_nodes = G.number_of_nodes()
num_edges = G.number_of_edges()
# Average Degree: How many edges per node?
avg_degree = (2 * num_edges) / num_nodes if num_nodes > 0 else 0
if avg_degree < 1.0:
return "WARNING: Under-Modeled. Too many isolated islands."
if avg_degree > 50.0:
return "WARNING: Over-Modeled. Graph is becoming a 'Hairball'."
return "SUCCESS: Graph density is in the Healthy Zone."
# TEST
healthy_g = nx.complete_graph(5) # Every node connected to every other node
print(analyze_graph_health(healthy_g))
6. Summary and Exercises
Pragmatism is the key to production RAG.
- Over-modeling wastes tokens and increases latency.
- Under-modeling kills the logic engine's ability to "Hop."
- Identity is the metric for turning a string into a node.
- Density should be monitored to keep the graph "Navigable."
Exercises
- Redesign Task: Look at a "Product" page on Amazon. List 3 attributes of that product that should be Nodes and 3 that should be Properties.
- The "Address" Debate: Is "Street Address" a Node or a Property? What about "Zip Code"? Why?
- Hairball Check: If you have a node
Type: Human, and it is connected to 1 billion other nodes, is it a useful node? Or should you delete it and moveHumaninto a property of those nodes?
In the next lesson, we will look at how to manage change: Versioning and Schema Evolution.