Preserving Document Hierarchy

Preserving Document Hierarchy

Learn how to maintain the parent-child relationships and heading structures during document parsing for RAG.

Preserving Document Hierarchy

When we parse complex documents like reports, legal contracts, or technical manuals, the hierarchical structure (Headings, Subheadings, Sections) is just as important as the text itself. Losing this hierarchy means losing the context of where a specific piece of information sits.

Why Hierarchy Matters

Imagine a document with a section titled "Safety Procedures" and a subsection "Electrical Systems". If you only extract the text from the subsection without knowing it belongs to "Safety Procedures", the retrieval model might struggle to distinguish it from "Electrical Systems" in a "Maintenance" section.

Implementing Hierarchical Parsing

Most modern PDF parsers like Unstructured or LlamaIndex provide tools to detect elements like Title, Header, and NarrativeText.

from unstructured.partition.pdf import partition_pdf

def parse_with_hierarchy(file_path):
    elements = partition_pdf(filename=file_path)
    
    hierarchy = []
    current_header = None
    
    for element in elements:
        if element.category == "Title" or element.category == "Header":
            current_header = element.text
        elif element.category == "NarrativeText":
            hierarchy.append({
                "header": current_header,
                "text": element.text,
                "metadata": element.metadata.to_dict()
            })
    
    return hierarchy

Recursive Character Splitting with Hierarchy

When chunking, we can use the hierarchy to ensure that headers are prepended to the content:

def chunk_with_context(section_title, text):
    # Prepend the section title to every chunk to maintain context
    chunk_prefix = f"Section: {section_title}\n\n"
    # ... logic to create chunks ...
    return f"{chunk_prefix}{text}"

Visualizing Document Trees

It often helps to think of your document as a tree structure:

graph TD
    A[Document: quarterly_report.pdf] --> B[Module 1: Executive Summary]
    A --> C[Module 2: Financial Performance]
    C --> D[Revenue Trends]
    C --> E[Expense Analysis]
    D --> F[North America]
    D --> G[EMEA]

By preserving this tree, you can implement Recursive Retrieval, where you first find the relevant high-level section and then zoom into the specific chunk.

Exercises

  1. Take a multi-page PDF with clear headings.
  2. Use a library like pdfplumber to extract lines that appear in Bold or have a larger font size.
  3. Map these lines as headers for the following paragraphs.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn