Top-K Neighborhood Retrieval: The Context Cloud

In module 4, we learned that a Neighborhood is the set of facts surrounding a node. In module 8, we learned how to write the Cypher for it. Now, we arrive at the Strategy: How do we use this for a production AI assistant?

Top-K Neighborhood Retrieval is the bread and butter of Graph RAG. It is the strategy you use whenever a user asks a question about a specific "Thing" (e.g., "What is Project Titan?"). In this lesson, we will look at how to refine the "Context Cloud," how to handle the "Large Neighborhood" problem, and how to format these disparate facts into a narrative that an LLM can actually use.

1. The Strategy: Breadth over Depth

When a user asks about a specific entity, they aren't looking for a deep logical chain across 10 hops. They are looking for a Portrait.

Level 1 (Direct): Facts owned by the entity (e.g., Name, Date, Owner).
Level 2 (Inferred): Facts about the entity's connections (e.g., The owner's department, The project's dependencies).

The Strategy: Pull the "Most Important" $K$ nodes within a fixed number of hops (usually 1 or 2).

2. Ranking the "Importance" Within the Cloud

If a node has 500 neighbors, you cannot include all of them in the prompt. You must Sub-Select.

Selection Criteria:

Semantic Match: Using the user's query vector to find the most relevant neighbors.
Topological Match: Using Node Degree or PageRank to find the most "Notable" neighbors.
Recentness: Prioritizing facts with the newest timestamp.

3. Serialization: From Subgraph to Story

Once you have your neighbor nodes (say, 10 nodes for a Person), you have two ways to tell the LLM:

A. The "List of Facts" (Simple):

"Sudeep works in London."
"Sudeep manages the AI Team."
"The AI Team uses Python."

B. The "JSON Graph" (Detailed):

{"node": "Sudeep", "relationships": [{"target": "AI Team", "type": "LEADS"}]}

RAG Tip: The "List of Facts" is usually better because LLMs are trained on natural language. They find it easier to weave these sentences into a coherent answer.

graph TD
    C((Central Entity)) --> N1[Fact 1]
    C --> N2[Fact 2]
    C --> N3[Fact 3]
    C --> N4[Fact 4]
    
    subgraph "Top-3 Ranking"
    N1
    N2
    N4
    end
    
    N3 -.-x LLM[LLM Prompt]
    N1 & N2 & N4 --> LLM
    
    style C fill:#4285F4,color:#fff
    style LLM fill:#34A853,color:#fff

4. Implementation: A "Portrait" Retrieval Logic in Python

def get_entity_portrait(entity_name, k=10):
    # 1. Fetch neighbors
    # 2. Sort by 'Importance/Weight' 
    # 3. Translate to sentences
    query = """
    MATCH (e {name: $name})-[r]-(neighbor)
    RETURN e.name + ' ' + type(r) + ' ' + neighbor.name as fact, 
           r.weight as importance
    ORDER BY importance DESC
    LIMIT $k
    """
    results = db.run(query, {"name": entity_name, "k": k})
    return "\n".join([r['fact'] for r in results])

# OUTPUT:
# Sudeep LEADS AI Team
# Sudeep LIVES_IN London
# Sudeep USES Python

5. Summary and Exercises

Neighborhood retrieval is about building a "Digital Dossier" on the fly.

Breadth (width) is more important than Depth (hops) for summary questions.
Ranking within the neighborhood is mandatory to fit in the context window.
Sentencizing the graph facts is the most reliable way to feed the LLM.

Exercises

Context Design: A user asks: "What is the history of our AWS usage?". Should you prioritize "1-hop direct facts" or "2-hop historical logs"?
The "Noise" Filter: If a neighborhood includes a link to a "City" node that has 1 million other links, should you include that in the portrait? (Hint: General "Super-nodes" like London or 2024 are often noise).
Visualization: Draw a 1-hop neighborhood of "Your Favorite Fruit." How many facts did you come up with?

In the next lesson, we will look at the opposite pattern: Path-Based Retrieval Patterns.