Community-Based Global Summarization: High-Level Insight

Traditional RAG is built for Bottom-Up questions: "What does Section 4.1 say?" But it fails miserably at Top-Down questions: "What are the main themes across all 5,000 documents?" A vector search for "Main themes" would just return a random sampling of the most generic sentences.

To solve this, we use Community-Based Global Summarization. Inspired by Microsoft's "GraphRAG" research, this strategy looks at the graph as a whole, groups related nodes into "Clusters" (Communities), and then generates a summary for each cluster. In this lesson, we will learn how to build this "Hierarchical Memory" and how to answer questions that require the AI to "See the forest, not the trees."

1. The Strategy: The Pyramid of Knowledge

Instead of one massive pile of data, we build a hierarchy.

Level 0 (The Facts): The raw nodes and edges.
Level 1 (Sub-Communities): Groups of closely related nodes (e.g., the "Billing Code" cluster).
Level 2 (Communities): Higher-level groups (e.g., the "Finance Department" cluster).
Level 3 (The Global Graph): A single summary of the entire knowledge base.

When the user asks a broad question, the AI doesn't search the facts. It searches the Level 2 and 3 Summaries.

2. Using the Leiden Algorithm for Clustering

How do we group 1 million nodes without a human doing it? We use the Leiden Algorithm (or Louvain). It looks at the "Density" of connections. If Node A, B, and C talk to each other much more than they talk to the rest of the graph, they form a "Community."

The RAG Step: Once a community is found, we send all its raw facts to an LLM and say: "Summarize the core activity and the main entities of this group into a 200-word report." We then store this report as a Property on a meta-node representing the community.

3. Global Query Handling (Map-Reduce)

When a global question arrives (e.g., "What were our biggest risks this year?"):

Map: We send the question to all Level 2 Community Summaries (e.g., Finance, Engineering, Sales).
Generate: Each community provides its local perspective on "Risk."
Reduce: A final LLM pass combines these into a unified, high-level global answer.

graph TD
    subgraph "Global View"
        G[Global Summary]
    end
    
    subgraph "Communities"
        C1[Finance Summary]
        C2[Eng Summary]
        C3[Legal Summary]
    end
    
    C1 --- G
    C2 --- G
    C3 --- G
    
    subgraph "Raw Facts"
        F1[Fact A] --- C1
        F2[Fact B] --- C1
        F3[Fact C] --- C2
    end
    
    style G fill:#4285F4,color:#fff
    style C1 fill:#34A853,color:#fff

4. Why this Beats Vector RAG

If you ask a Vector RAG system about "Risks," it will find 10 snippets about specific bugs. It will miss the "Structural Risk" that is only apparent when you look at how those 10 bugs are all connected to the same outdated server. Community Summarization sees the "Connection" and names the "Theme."

5. Implementation: The Summarization Workflow

def answer_global_query(query):
    # 1. Fetch all Level 2 Community Summaries
    summaries = db.run("MATCH (c:Community {level: 2}) RETURN c.summary")
    
    # 2. Ask the LLM to process them in parallel
    individual_insights = []
    for s in summaries:
        insight = llm.generate(f"Question: {query} context: {s}")
        individual_insights.append(insight)
        
    # 3. Final Synthesis
    final_answer = llm.generate(f"Synthesize these insights: {individual_insights}")
    return final_answer

6. Summary and Exercises

Global Summarization provides "Executive Intelligence."

Community Detection (Leiden) identifies natural clusters of data.
Community Summaries pre-calculate high-level insights.
Hierarchical Retrieval allows the AI to choose the right level of detail.
Map-Reduce patterns enable summaries of massive, million-node graphs.

Exercises

Community Spotting: If you have a graph of a "Social Network," what would a Community represent? (A family? A hobby group? A city?).
The Token Tradeoff: Is it cheaper to summarize every community once (Pre-calculation) or to search the whole graph for every user query?
Prompt Design: Write a prompt that summarizes a "Community" of 50 facts into a 1-sentence "Headline."

In the next lesson, we will look at a hybrid logic: Semantic and Similarity-Driven Graph Searches.