
Preserving Document Hierarchy
Learn how to maintain the parent-child relationships and heading structures during document parsing for RAG.
Preserving Document Hierarchy
When we parse complex documents like reports, legal contracts, or technical manuals, the hierarchical structure (Headings, Subheadings, Sections) is just as important as the text itself. Losing this hierarchy means losing the context of where a specific piece of information sits.
Why Hierarchy Matters
Imagine a document with a section titled "Safety Procedures" and a subsection "Electrical Systems". If you only extract the text from the subsection without knowing it belongs to "Safety Procedures", the retrieval model might struggle to distinguish it from "Electrical Systems" in a "Maintenance" section.
Implementing Hierarchical Parsing
Most modern PDF parsers like Unstructured or LlamaIndex provide tools to detect elements like Title, Header, and NarrativeText.
from unstructured.partition.pdf import partition_pdf
def parse_with_hierarchy(file_path):
elements = partition_pdf(filename=file_path)
hierarchy = []
current_header = None
for element in elements:
if element.category == "Title" or element.category == "Header":
current_header = element.text
elif element.category == "NarrativeText":
hierarchy.append({
"header": current_header,
"text": element.text,
"metadata": element.metadata.to_dict()
})
return hierarchy
Recursive Character Splitting with Hierarchy
When chunking, we can use the hierarchy to ensure that headers are prepended to the content:
def chunk_with_context(section_title, text):
# Prepend the section title to every chunk to maintain context
chunk_prefix = f"Section: {section_title}\n\n"
# ... logic to create chunks ...
return f"{chunk_prefix}{text}"
Visualizing Document Trees
It often helps to think of your document as a tree structure:
graph TD
A[Document: quarterly_report.pdf] --> B[Module 1: Executive Summary]
A --> C[Module 2: Financial Performance]
C --> D[Revenue Trends]
C --> E[Expense Analysis]
D --> F[North America]
D --> G[EMEA]
By preserving this tree, you can implement Recursive Retrieval, where you first find the relevant high-level section and then zoom into the specific chunk.
Exercises
- Take a multi-page PDF with clear headings.
- Use a library like
pdfplumberto extract lines that appear in Bold or have a larger font size. - Map these lines as headers for the following paragraphs.