
Community-Based Global Summarization: High-Level Insight
Solve the 'Summarize 1 Million Sentences' problem. Learn how to group your Knowledge Graph into communities and generate high-level summaries that answer top-down executive questions.
Community-Based Global Summarization: High-Level Insight
Traditional RAG is built for Bottom-Up questions: "What does Section 4.1 say?" But it fails miserably at Top-Down questions: "What are the main themes across all 5,000 documents?" A vector search for "Main themes" would just return a random sampling of the most generic sentences.
To solve this, we use Community-Based Global Summarization. Inspired by Microsoft's "GraphRAG" research, this strategy looks at the graph as a whole, groups related nodes into "Clusters" (Communities), and then generates a summary for each cluster. In this lesson, we will learn how to build this "Hierarchical Memory" and how to answer questions that require the AI to "See the forest, not the trees."
1. The Strategy: The Pyramid of Knowledge
Instead of one massive pile of data, we build a hierarchy.
- Level 0 (The Facts): The raw nodes and edges.
- Level 1 (Sub-Communities): Groups of closely related nodes (e.g., the "Billing Code" cluster).
- Level 2 (Communities): Higher-level groups (e.g., the "Finance Department" cluster).
- Level 3 (The Global Graph): A single summary of the entire knowledge base.
When the user asks a broad question, the AI doesn't search the facts. It searches the Level 2 and 3 Summaries.
2. Using the Leiden Algorithm for Clustering
How do we group 1 million nodes without a human doing it? We use the Leiden Algorithm (or Louvain). It looks at the "Density" of connections. If Node A, B, and C talk to each other much more than they talk to the rest of the graph, they form a "Community."
The RAG Step: Once a community is found, we send all its raw facts to an LLM and say: "Summarize the core activity and the main entities of this group into a 200-word report." We then store this report as a Property on a meta-node representing the community.
3. Global Query Handling (Map-Reduce)
When a global question arrives (e.g., "What were our biggest risks this year?"):
- Map: We send the question to all Level 2 Community Summaries (e.g., Finance, Engineering, Sales).
- Generate: Each community provides its local perspective on "Risk."
- Reduce: A final LLM pass combines these into a unified, high-level global answer.
graph TD
subgraph "Global View"
G[Global Summary]
end
subgraph "Communities"
C1[Finance Summary]
C2[Eng Summary]
C3[Legal Summary]
end
C1 --- G
C2 --- G
C3 --- G
subgraph "Raw Facts"
F1[Fact A] --- C1
F2[Fact B] --- C1
F3[Fact C] --- C2
end
style G fill:#4285F4,color:#fff
style C1 fill:#34A853,color:#fff
4. Why this Beats Vector RAG
If you ask a Vector RAG system about "Risks," it will find 10 snippets about specific bugs. It will miss the "Structural Risk" that is only apparent when you look at how those 10 bugs are all connected to the same outdated server. Community Summarization sees the "Connection" and names the "Theme."
5. Implementation: The Summarization Workflow
def answer_global_query(query):
# 1. Fetch all Level 2 Community Summaries
summaries = db.run("MATCH (c:Community {level: 2}) RETURN c.summary")
# 2. Ask the LLM to process them in parallel
individual_insights = []
for s in summaries:
insight = llm.generate(f"Question: {query} context: {s}")
individual_insights.append(insight)
# 3. Final Synthesis
final_answer = llm.generate(f"Synthesize these insights: {individual_insights}")
return final_answer
6. Summary and Exercises
Global Summarization provides "Executive Intelligence."
- Community Detection (Leiden) identifies natural clusters of data.
- Community Summaries pre-calculate high-level insights.
- Hierarchical Retrieval allows the AI to choose the right level of detail.
- Map-Reduce patterns enable summaries of massive, million-node graphs.
Exercises
- Community Spotting: If you have a graph of a "Social Network," what would a Community represent? (A family? A hobby group? A city?).
- The Token Tradeoff: Is it cheaper to summarize every community once (Pre-calculation) or to search the whole graph for every user query?
- Prompt Design: Write a prompt that summarizes a "Community" of 50 facts into a 1-sentence "Headline."
In the next lesson, we will look at a hybrid logic: Semantic and Similarity-Driven Graph Searches.