Parsing Structured vs Unstructured Documents

Document parsing strategies differ based on whether the document has a predictable structure. Understanding this distinction is crucial for effective RAG systems.

Understanding Document Types

Structured Documents:

Forms with fixed fields
Invoices with consistent layouts
Tax documents
Medical records
Legal contracts with templates

Unstructured Documents:

Research papers
News articles
Email threads
Meeting notes
General reports

Parsing Structured Documents

Structured documents have predictable layouts, allowing for template-based extraction.

from pydantic import BaseModel

# Define expected structure
class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total_amount: float
    vendor_name: str
    line_items: list

def parse_structured_invoice(pdf_path):
    """
    Extract data from invoice using template matching.
    Relies on known field positions and labels.
    """
    text = extract_text_from_pdf(pdf_path)
    
    # Define patterns for each field
    patterns = {
        'invoice_number': r'Invoice #:\s*(\w+)',
        'date': r'Date:\s*(\d{2}/\d{2}/\d{4})',
        'total': r'Total:\s*\$?([\d,]+\.?\d*)',
        'vendor': r'From:\s*(.+?)(?:\n|$)'
    }
    
    # Extract using regex patterns
    import re
    extracted = {}
    for field, pattern in patterns.items():
        match = re.search(pattern, text)
        if match:
            extracted[field] = match.group(1)
    
    # Validate against schema
    try:
        invoice = InvoiceData(**extracted)
        return invoice.dict()
    except Exception as e:
        return {'error': f'Failed to parse: {e}', 'raw_text': text}

Why This Works:

Invoices follow templates
Field labels are consistent ("Invoice #:", "Total:")
Layout is predictable
We can use regex patterns reliably

Parsing Unstructured Documents

Unstructured documents have variable layouts and require flexible parsing.

def parse_unstructured_article(pdf_path):
    """
    Extract content from research paper or article.
    No assumptions about field positions.
    """
    # Extract raw text
    pages = extract_pdf_pages(pdf_path)
    
    # Identify document sections heuristically
    sections = []
    current_section = {'title': 'Introduction', 'content': ''}
    
    for page in pages:
        text = page['text']
        
        # Heuristic: Lines in ALL CAPS or title case might be headers
        lines = text.split('\n')
        for line in lines:
            if is_section_header(line):
                # Save previous section
                if current_section['content']:
                    sections.append(current_section)
                
                # Start new section
                current_section = {
                    'title': line.strip(),
                    'content': ''
                }
            else:
                current_section['content'] += line + '\n'
    
    # Add final section
    sections.append(current_section)
    
    return {
        'type': 'unstructured_article',
        'sections': sections,
        'full_text': '\n'.join(p['text'] for p in pages)
    }

def is_section_header(line):
    """
    Heuristics to detect section headers.
    """
    line = line.strip()
    
    # Check if line is short (headers are typically short)
    if len(line) > 100:
        return False
    
    # Check if line is in title case or all caps
    if line.isupper() or line.istitle():
        return True
    
    # Check for numbered sections (1., 2., I., II., etc.)
    if re.match(r'^[IVX]+\.|^\d+\.', line):
        return True
    
    return False

Why This Approach:

Articles have variable structures
Section titles vary in format
No fixed field positions
Need heuristics, not templates

LLM-Based Parsing for Complex Documents

For documents that resist rule-based parsing, use LLMs.

def llm_parse_document(text):
    """
    Use Claude to extract structure from complex documents.
    Useful when rules and heuristics fail.
    """
    prompt = f"""
Analyze this document and extract its structure.

Identify:
1. Document type (report, article, form, etc.)
2. Main sections and their titles
3. Key entities (names, dates, amounts)
4. Metadata (author, date, title)

Document:
{text[:3000]}  # First 3000 chars

Return as JSON with keys: document_type, sections, entities, metadata
    """
    
    response = claude.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000,
        temperature=0,  # Deterministic
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse LLM response
    import json
    try:
        parsed = json.loads(response.content[0].text)
        return parsed
    except:
        return {'error': 'Could not parse LLM response', 'raw': response.content[0].text}

When to Use LLM Parsing:

Complex, variable layouts
Mixed structured/unstructured content
Poor quality scans
Multiple languages
Non-standard formats

Hybrid Approach: Best of Both Worlds

def intelligent_parse(pdf_path):
    """
    Automatically choose parsing strategy based on document analysis.
    """
    # Step 1: Quick analysis
    sample_text = extract_first_page(pdf_path)
    
    # Step 2: Classify document type
    doc_type = classify_document(sample_text)
    
    # Step 3: Route to appropriate parser
    if doc_type in ['invoice', 'form', 'receipt']:
        # Use template-based parsing
        return parse_structured_invoice(pdf_path)
    
    elif doc_type in ['article', 'report', 'email']:
        # Use heuristic parsing
        return parse_unstructured_article(pdf_path)
    
    else:
        # Fall back to LLM parsing
        full_text = extract_all_text(pdf_path)
        return llm_parse_document(full_text)

def classify_document(text):
    """
    Simple classification based on keywords.
    """
    text_lower = text.lower()
    
    if 'invoice' in text_lower and 'total' in text_lower:
        return 'invoice'
    elif 'abstract' in text_lower and 'references' in text_lower:
        return 'article'
    elif 'from:' in text_lower and 'subject:' in text_lower:
        return 'email'
    else:
        return 'unknown'

Key Takeaways

Structured documents: Use templates and regex for reliable extraction
Unstructured documents: Use heuristics and flexible parsing
Complex cases: Leverage LLMs for intelligent extraction
Hybrid approach: Automatically route documents to the right parser

Next lesson: Page-level vs section-level parsing strategies.