Parsing Structured vs Unstructured Documents

Parsing Structured vs Unstructured Documents

Learn to extract content from structured documents (forms, invoices) and unstructured documents (reports, articles) with different parsing strategies.

Parsing Structured vs Unstructured Documents

Document parsing strategies differ based on whether the document has a predictable structure. Understanding this distinction is crucial for effective RAG systems.

Understanding Document Types

Structured Documents:

  • Forms with fixed fields
  • Invoices with consistent layouts
  • Tax documents
  • Medical records
  • Legal contracts with templates

Unstructured Documents:

  • Research papers
  • News articles
  • Email threads
  • Meeting notes
  • General reports

Parsing Structured Documents

Structured documents have predictable layouts, allowing for template-based extraction.

from pydantic import BaseModel

# Define expected structure
class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total_amount: float
    vendor_name: str
    line_items: list

def parse_structured_invoice(pdf_path):
    """
    Extract data from invoice using template matching.
    Relies on known field positions and labels.
    """
    text = extract_text_from_pdf(pdf_path)
    
    # Define patterns for each field
    patterns = {
        'invoice_number': r'Invoice #:\s*(\w+)',
        'date': r'Date:\s*(\d{2}/\d{2}/\d{4})',
        'total': r'Total:\s*\$?([\d,]+\.?\d*)',
        'vendor': r'From:\s*(.+?)(?:\n|$)'
    }
    
    # Extract using regex patterns
    import re
    extracted = {}
    for field, pattern in patterns.items():
        match = re.search(pattern, text)
        if match:
            extracted[field] = match.group(1)
    
    # Validate against schema
    try:
        invoice = InvoiceData(**extracted)
        return invoice.dict()
    except Exception as e:
        return {'error': f'Failed to parse: {e}', 'raw_text': text}

Why This Works:

  • Invoices follow templates
  • Field labels are consistent ("Invoice #:", "Total:")
  • Layout is predictable
  • We can use regex patterns reliably

Parsing Unstructured Documents

Unstructured documents have variable layouts and require flexible parsing.

def parse_unstructured_article(pdf_path):
    """
    Extract content from research paper or article.
    No assumptions about field positions.
    """
    # Extract raw text
    pages = extract_pdf_pages(pdf_path)
    
    # Identify document sections heuristically
    sections = []
    current_section = {'title': 'Introduction', 'content': ''}
    
    for page in pages:
        text = page['text']
        
        # Heuristic: Lines in ALL CAPS or title case might be headers
        lines = text.split('\n')
        for line in lines:
            if is_section_header(line):
                # Save previous section
                if current_section['content']:
                    sections.append(current_section)
                
                # Start new section
                current_section = {
                    'title': line.strip(),
                    'content': ''
                }
            else:
                current_section['content'] += line + '\n'
    
    # Add final section
    sections.append(current_section)
    
    return {
        'type': 'unstructured_article',
        'sections': sections,
        'full_text': '\n'.join(p['text'] for p in pages)
    }

def is_section_header(line):
    """
    Heuristics to detect section headers.
    """
    line = line.strip()
    
    # Check if line is short (headers are typically short)
    if len(line) > 100:
        return False
    
    # Check if line is in title case or all caps
    if line.isupper() or line.istitle():
        return True
    
    # Check for numbered sections (1., 2., I., II., etc.)
    if re.match(r'^[IVX]+\.|^\d+\.', line):
        return True
    
    return False

Why This Approach:

  • Articles have variable structures
  • Section titles vary in format
  • No fixed field positions
  • Need heuristics, not templates

LLM-Based Parsing for Complex Documents

For documents that resist rule-based parsing, use LLMs.

def llm_parse_document(text):
    """
    Use Claude to extract structure from complex documents.
    Useful when rules and heuristics fail.
    """
    prompt = f"""
Analyze this document and extract its structure.

Identify:
1. Document type (report, article, form, etc.)
2. Main sections and their titles
3. Key entities (names, dates, amounts)
4. Metadata (author, date, title)

Document:
{text[:3000]}  # First 3000 chars

Return as JSON with keys: document_type, sections, entities, metadata
    """
    
    response = claude.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000,
        temperature=0,  # Deterministic
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse LLM response
    import json
    try:
        parsed = json.loads(response.content[0].text)
        return parsed
    except:
        return {'error': 'Could not parse LLM response', 'raw': response.content[0].text}

When to Use LLM Parsing:

  • Complex, variable layouts
  • Mixed structured/unstructured content
  • Poor quality scans
  • Multiple languages
  • Non-standard formats

Hybrid Approach: Best of Both Worlds

def intelligent_parse(pdf_path):
    """
    Automatically choose parsing strategy based on document analysis.
    """
    # Step 1: Quick analysis
    sample_text = extract_first_page(pdf_path)
    
    # Step 2: Classify document type
    doc_type = classify_document(sample_text)
    
    # Step 3: Route to appropriate parser
    if doc_type in ['invoice', 'form', 'receipt']:
        # Use template-based parsing
        return parse_structured_invoice(pdf_path)
    
    elif doc_type in ['article', 'report', 'email']:
        # Use heuristic parsing
        return parse_unstructured_article(pdf_path)
    
    else:
        # Fall back to LLM parsing
        full_text = extract_all_text(pdf_path)
        return llm_parse_document(full_text)

def classify_document(text):
    """
    Simple classification based on keywords.
    """
    text_lower = text.lower()
    
    if 'invoice' in text_lower and 'total' in text_lower:
        return 'invoice'
    elif 'abstract' in text_lower and 'references' in text_lower:
        return 'article'
    elif 'from:' in text_lower and 'subject:' in text_lower:
        return 'email'
    else:
        return 'unknown'

Key Takeaways

  1. Structured documents: Use templates and regex for reliable extraction
  2. Unstructured documents: Use heuristics and flexible parsing
  3. Complex cases: Leverage LLMs for intelligent extraction
  4. Hybrid approach: Automatically route documents to the right parser

Next lesson: Page-level vs section-level parsing strategies.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn