PDFs (Native vs Scanned)

PDFs are ubiquitous but challenging. Understanding native vs scanned PDFs is critical.

PDF Types

graph TD
    A[PDF Document] --> B{Type?}
    B -->|Native| C[Digital Text]
    B -->|Scanned| D[Image of Text]
    
    C --> E[Direct Text Extraction]
    D --> F[OCR Required]
    
    style E fill:#d4edda
    style F fill:#fff3cd

Native PDFs

import PyPDF2

def extract_native_pdf(file_path):
    with open(file_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        
        pages = []
        for page_num, page in enumerate(reader.pages):
            text = page.extract_text()
            
            pages.append({
                'page': page_num + 1,
                'content': text,
                'type': 'native'
            })
        
        return pages

Scanned PDFs

from pdf2image import convert_from_path
import pytesseract

def extract_scanned_pdf(file_path):
    # Convert PDF pages to images
    images = convert_from_path(file_path)
    
    pages = []
    for i, image in enumerate(images):
        # OCR each page
        text = pytesseract.image_to_string(image)
        
        pages.append({
            'page': i + 1,
            'content': text,
            'type': 'scanned-ocr'
        })
    
    return pages

Hybrid Approach

def process_pdf(file_path):
    # Try native extraction first
    text = extract_native_pdf(file_path)
    
    # Check if extraction was successful
    if is_low_quality(text):
        # Fall back to OCR
        text = extract_scanned_pdf(file_path)
    
    return text

def is_low_quality(text):
    # Heuristic: if very little text extracted, likely scanned
    avg_chars_per_page = sum(len(p['content']) for p in text) / len(text)
    return avg_chars_per_page &lt; 100

Advanced: Layout-Aware Extraction

# Use pdfplumber for better layout preservation
import pdfplumber

def extract_with_layout(file_path):
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            # Extract text with positions
            text = page.extract_text()
            
            # Extract tables separately
            tables = page.extract_tables()
            
            # Extract images
            images = page.images
            
            yield {
                'text': text,
                'tables': tables,
                'images': images
            }

Best Practices

Detection: Always check if PDF is native or scanned
Fallback: Have OCR ready for scanned docs
Tables: Extract separately with specialized tools
Images: Process embedded images separately

Next: Image processing.

PDFs (Native vs Scanned)

PDF Types

Native PDFs

Scanned PDFs

Hybrid Approach

Advanced: Layout-Aware Extraction

Best Practices

Subscribe to our newsletter