Link Prediction: Guessing the Missing Facts

A Knowledge Graph is always incomplete. A person might omit a relationship in their bio. A document might mention a project but forget to mention the department. If your graph only contains Explicit Facts, your AI agent is limited by the "Silence" of your data. Link Prediction changes this. It allows the graph engine to say: "There is an 85% probability that Sudeep is related to Project X, even though I haven't seen a single document that says so."

In this lesson, we will explore the science of Relationship Suggestion. We will look at Common Neighbors, Jaccard Similarity, and Preferential Attachment. We will learn how to use these metrics to "Fill the Gaps" in our RAG system and how to present these "Predicted Facts" to the LLM as useful (but unverified) context.

1. What is Link Prediction?

Link Prediction is the task of predicting the likelihood of an edge between two nodes that are currently disconnected.

The Logic:

If Sudeep works with Jane.
If Sudeep works with Bob.
If Jane and Bob both work on Project Titan.
Prediction: It is highly likely that Sudeep is also related to Project Titan.

2. Algorithms of Connection

Common Neighbors

The simplest metric. Counting how many shared friends/connections two nodes have.

If we have 10 mutual friends, we are likely to meet soon.

Jaccard Coefficient

Like Common Neighbors, but normalized by the total number of neighbors.

Prevents "Hub Nodes" (big cities) from being predicted as "related" to everyone just because they are big.

Adamic-Adar

Weights common neighbors by how "Unique" they are.

If we share a weird hobby (e.g., "Underwater Basket Weaving"), that is a stronger signal of connection than sharing a common interest (e.g., "Breathing Air").

3. Using Predicted Links for RAG

When your agent retrieves context, it can look at these "Hidden Links."

The Result: "I found no direct mention of Sudeep on Project Titan, but 90% of his close collaborators are on that project, suggesting he may be an uncredited contributor."

This is Predictive Reasoning. It allows the agent to provide "Leads" rather than just "Dead Ends."

graph LR
    S[Sudeep] --- C1[Colleague 1]
    S --- C2[Colleague 2]
    
    C1 --- P[Project Titan]
    C2 --- P
    
    S -.->|Predicted Link: 85%| P
    
    style S fill:#4285F4,color:#fff
    style P fill:#34A853,color:#fff
    style C1 fill:#f4b400,color:#fff

4. Implementation: Finding Potential Relationships with Cypher

We can use the gds.similarity.jaccard function to find nodes that "Should" be connected.

// Find pairs of people who are NOT connected, 
// but share many of the same projects.
MATCH (p1:Person)-[:WORKS_ON]->(proj:Project)
MATCH (p2:Person)-[:WORKS_ON]->(proj)
WHERE NOT (p1)-[:KNOWS]-(p2) AND id(p1) < id(p2)

WITH p1, p2, count(proj) as common_projects
WHERE common_projects > 3
RETURN p1.name, p2.name, common_projects
ORDER BY common_projects DESC;

// You can then insert a '[:PREDICTED_KNOWS]' edge for the AI to find.

5. Summary and Exercises

Link prediction is the "Intuition" of the graph database.

Topology is a signal for hidden relationships.
Common Neighbors is the foundation of the logic.
Uncredited connections help avoid the "Cold Start" problem in retrieval.
Transparency: Always label predicted links as [:PREDICTED] so the AI knows to use caution.

Exercises

Prediction Log: Think of 3 people you know. Do they know each other? If not, do they share enough common neighbors that a computer would think they know each other?
Safety Check: What is the risk of an AI agent stating a "Predicted Fact" as a "Hard Fact"? How would you change your system prompt to handle "Probabilistic Edges"?
Visualization: Draw a graph where two nodes are connected to the exact same 5 other nodes, but have no direct link. Calculate the Common Neighbor score (It's 5!).

In the next lesson, we will look at cleaning the graph: Similarity Algorithms for Entity Reconciliation.