
Documents, Chunks, Entities, and Facts: The Atoms of Knowledge
Deconstruct information into its most granular components. Learn how to transform raw documents into searchable chunks, identifiable entities, and verifiable facts for your Graph RAG pipeline.
Documents, Chunks, Entities, and Facts: The Atoms of Knowledge
In the world of AI engineering, we often use words like "Data" or "Context" as blanket terms. But to build a high-performance Graph RAG system, we must be much more precise. We need to understand the hierarchy of information. We start with a Document, we split it into Chunks, we extract Entities, and we verify Facts.
In this lesson, we will explore each of these four levels. We will learn why a "Chunk" is the limit of Vector RAG, but an "Entity" is the beginning of Graph RAG. We will also look at the "Fact" as the atomic unit of truth that allows our agents to overcome hallucinations.
1. The Document: The Container of Context
Definition: A complete unit of information (e.g., a PDF manual, a single email, a markdown file).
The Document is the "Source of Truth." It provides the Provenance (where the data came from). In production systems, we never want the agent to just say a fact; we want it to say: "This fact comes from Document [ID-101]."
- Challenge: Documents are often too large for an LLM's "Immediate Recall" or "Attention" to handle perfectly, leading to the "Lost in the Middle" phenomenon.
2. The Chunk: The Vector Specialist
Definition: A segment of a document (e.g., 500 characters, a paragraph, or a page).
As we learned in Module 1, Chunks are the primary unit of Vector RAG.
- The Purpose: To provide a manageable "Snippet" of text that fits into an embedding model.
- The Limitation: A chunk is a "dumb" container. It doesn't know that the word "He" in Chunk 2 refers to "Sudeep" in Chunk 1. This is Coreference Fragmentation.
3. The Entity: The Identity of Knowledge
Definition: A unique "Thing" or "Object" mentioned in the text (e.g., a Person, a Company, a Date, a Location, a Project).
Entities are the Nodes of our Knowledge Graph. This is the first major leap toward Graph RAG.
Entity Resolution (The Hard Part):
If one document mentions "Google" and another mentions "Alphabet Inc," a human knows they are the same entity. A Vector RAG system treats them as two separate vectors. A Graph RAG system performs Entity Resolution to merge them into a single node.
Why Entities Matter: Instead of searching for "Text that looks like the query," search for "The node named 'Alphabet Inc' and its connections." This eliminates the ambiguity of natural language.
4. The Fact (Triplet): The Unit of Truth
Definition: A statement of a relationship between two entities, often expressed as a (Subject) -> [Predicate] -> (Object) triplet.
Example: (Llama-3) -> [DEVELOPED_BY] -> (Meta)
A "Fact" is the atomic unit of the Knowledge Graph. It is what connects the nodes (Edges).
- Vector RAG: Stores "Meta developed the Llama-3 model in 2024" as a bunch of tokens.
- Graph RAG: Stores it as two nodes (Llama-3, Meta) and a labeled relationship (DEVELOPED_BY).
graph TD
DOC[Document: AI Report] -->|Split| C1[Chunk 1]
DOC -->|Split| C2[Chunk 2]
C1 -->|Extract| E1((Entity: Sudeep))
C1 -->|Extract| E2((Entity: London))
E1 ---|Visited| E2
style E1 fill:#4285F4,color:#fff
style E2 fill:#4285F4,color:#fff
style DOC fill:#f4f4f4
5. The "Context Window" Hierarchy
When an agent answers a question, it builds its "Context" from these atoms:
- Retrieve Entities: "I see the user is asking about 'Sudeep'."
- Retrieve Facts: "Sudeep is the CEO. Sudeep visited London."
- Retrieve Chunks (Hybrid): "Here are the 3 paragraphs that describe the London visit in detail."
- Synthesize: "CEO Sudeep visited the London office to discuss Q1 results."
This hierarchical retrieval is significantly more robust than just grabbing 5 random chunks.
6. Implementation: A Simple Entity Extractor with Python
Let's use a basic pattern to extract Entities from a raw chunk.
import spacy
# Load a lightweight NLP model
nlp = spacy.load("en_core_web_sm")
text = "Apple decided to open a new data center in Ireland last Friday."
def extract_atoms(text):
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
# OUTPUT:
# [('Apple', 'ORG'), ('Ireland', 'GPE'), ('last Friday', 'DATE')]
return entities
# In Graph RAG, these ORGs, GPEs, and DATEs become our Graph Nodes.
print(extract_atoms(text))
7. Summary and Exercises
Knowledge is a hierarchy of increasing order.
- Documents are the source.
- Chunks are the statistical units for search.
- Entities are the unique identities (Nodes).
- Facts are the relationships (Edges).
Exercises
- Atomization Task: Take the sentence: "The CEO of Tesla, Elon Musk, announced a new factory in Austin, Texas on Tuesday." List the 4 Entities and the 3 Facts (Triplets) you can find.
- Entity Resolution: If a user says "The Apple Phone" and your document says "iPhone 15," how would you "Resolve" these to the same entity in a graph?
- Chunking Strategy: Why would a "Small Chunk" (100 characters) be better for finding a specific Entity, but a "Large Chunk" (1000 characters) be better for finding a complex Fact?
In the next lesson, we will explore the "Glue" that holds these atoms together: Implicit vs Explicit Relationships.