
Parsing Structured vs Unstructured Documents
Learn to extract content from structured documents (forms, invoices) and unstructured documents (reports, articles) with different parsing strategies.
Parsing Structured vs Unstructured Documents
Document parsing strategies differ based on whether the document has a predictable structure. Understanding this distinction is crucial for effective RAG systems.
Understanding Document Types
Structured Documents:
- Forms with fixed fields
- Invoices with consistent layouts
- Tax documents
- Medical records
- Legal contracts with templates
Unstructured Documents:
- Research papers
- News articles
- Email threads
- Meeting notes
- General reports
Parsing Structured Documents
Structured documents have predictable layouts, allowing for template-based extraction.
from pydantic import BaseModel
# Define expected structure
class InvoiceData(BaseModel):
invoice_number: str
date: str
total_amount: float
vendor_name: str
line_items: list
def parse_structured_invoice(pdf_path):
"""
Extract data from invoice using template matching.
Relies on known field positions and labels.
"""
text = extract_text_from_pdf(pdf_path)
# Define patterns for each field
patterns = {
'invoice_number': r'Invoice #:\s*(\w+)',
'date': r'Date:\s*(\d{2}/\d{2}/\d{4})',
'total': r'Total:\s*\$?([\d,]+\.?\d*)',
'vendor': r'From:\s*(.+?)(?:\n|$)'
}
# Extract using regex patterns
import re
extracted = {}
for field, pattern in patterns.items():
match = re.search(pattern, text)
if match:
extracted[field] = match.group(1)
# Validate against schema
try:
invoice = InvoiceData(**extracted)
return invoice.dict()
except Exception as e:
return {'error': f'Failed to parse: {e}', 'raw_text': text}
Why This Works:
- Invoices follow templates
- Field labels are consistent ("Invoice #:", "Total:")
- Layout is predictable
- We can use regex patterns reliably
Parsing Unstructured Documents
Unstructured documents have variable layouts and require flexible parsing.
def parse_unstructured_article(pdf_path):
"""
Extract content from research paper or article.
No assumptions about field positions.
"""
# Extract raw text
pages = extract_pdf_pages(pdf_path)
# Identify document sections heuristically
sections = []
current_section = {'title': 'Introduction', 'content': ''}
for page in pages:
text = page['text']
# Heuristic: Lines in ALL CAPS or title case might be headers
lines = text.split('\n')
for line in lines:
if is_section_header(line):
# Save previous section
if current_section['content']:
sections.append(current_section)
# Start new section
current_section = {
'title': line.strip(),
'content': ''
}
else:
current_section['content'] += line + '\n'
# Add final section
sections.append(current_section)
return {
'type': 'unstructured_article',
'sections': sections,
'full_text': '\n'.join(p['text'] for p in pages)
}
def is_section_header(line):
"""
Heuristics to detect section headers.
"""
line = line.strip()
# Check if line is short (headers are typically short)
if len(line) > 100:
return False
# Check if line is in title case or all caps
if line.isupper() or line.istitle():
return True
# Check for numbered sections (1., 2., I., II., etc.)
if re.match(r'^[IVX]+\.|^\d+\.', line):
return True
return False
Why This Approach:
- Articles have variable structures
- Section titles vary in format
- No fixed field positions
- Need heuristics, not templates
LLM-Based Parsing for Complex Documents
For documents that resist rule-based parsing, use LLMs.
def llm_parse_document(text):
"""
Use Claude to extract structure from complex documents.
Useful when rules and heuristics fail.
"""
prompt = f"""
Analyze this document and extract its structure.
Identify:
1. Document type (report, article, form, etc.)
2. Main sections and their titles
3. Key entities (names, dates, amounts)
4. Metadata (author, date, title)
Document:
{text[:3000]} # First 3000 chars
Return as JSON with keys: document_type, sections, entities, metadata
"""
response = claude.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2000,
temperature=0, # Deterministic
messages=[{"role": "user", "content": prompt}]
)
# Parse LLM response
import json
try:
parsed = json.loads(response.content[0].text)
return parsed
except:
return {'error': 'Could not parse LLM response', 'raw': response.content[0].text}
When to Use LLM Parsing:
- Complex, variable layouts
- Mixed structured/unstructured content
- Poor quality scans
- Multiple languages
- Non-standard formats
Hybrid Approach: Best of Both Worlds
def intelligent_parse(pdf_path):
"""
Automatically choose parsing strategy based on document analysis.
"""
# Step 1: Quick analysis
sample_text = extract_first_page(pdf_path)
# Step 2: Classify document type
doc_type = classify_document(sample_text)
# Step 3: Route to appropriate parser
if doc_type in ['invoice', 'form', 'receipt']:
# Use template-based parsing
return parse_structured_invoice(pdf_path)
elif doc_type in ['article', 'report', 'email']:
# Use heuristic parsing
return parse_unstructured_article(pdf_path)
else:
# Fall back to LLM parsing
full_text = extract_all_text(pdf_path)
return llm_parse_document(full_text)
def classify_document(text):
"""
Simple classification based on keywords.
"""
text_lower = text.lower()
if 'invoice' in text_lower and 'total' in text_lower:
return 'invoice'
elif 'abstract' in text_lower and 'references' in text_lower:
return 'article'
elif 'from:' in text_lower and 'subject:' in text_lower:
return 'email'
else:
return 'unknown'
Key Takeaways
- Structured documents: Use templates and regex for reliable extraction
- Unstructured documents: Use heuristics and flexible parsing
- Complex cases: Leverage LLMs for intelligent extraction
- Hybrid approach: Automatically route documents to the right parser
Next lesson: Page-level vs section-level parsing strategies.