PDFs (Native vs Scanned)

PDFs (Native vs Scanned)

Master PDF processing for RAG, handling both native digital PDFs and scanned documents.

PDFs (Native vs Scanned)

PDFs are ubiquitous but challenging. Understanding native vs scanned PDFs is critical.

PDF Types

graph TD
    A[PDF Document] --> B{Type?}
    B -->|Native| C[Digital Text]
    B -->|Scanned| D[Image of Text]
    
    C --> E[Direct Text Extraction]
    D --> F[OCR Required]
    
    style E fill:#d4edda
    style F fill:#fff3cd

Native PDFs

import PyPDF2

def extract_native_pdf(file_path):
    with open(file_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        
        pages = []
        for page_num, page in enumerate(reader.pages):
            text = page.extract_text()
            
            pages.append({
                'page': page_num + 1,
                'content': text,
                'type': 'native'
            })
        
        return pages

Scanned PDFs

from pdf2image import convert_from_path
import pytesseract

def extract_scanned_pdf(file_path):
    # Convert PDF pages to images
    images = convert_from_path(file_path)
    
    pages = []
    for i, image in enumerate(images):
        # OCR each page
        text = pytesseract.image_to_string(image)
        
        pages.append({
            'page': i + 1,
            'content': text,
            'type': 'scanned-ocr'
        })
    
    return pages

Hybrid Approach

def process_pdf(file_path):
    # Try native extraction first
    text = extract_native_pdf(file_path)
    
    # Check if extraction was successful
    if is_low_quality(text):
        # Fall back to OCR
        text = extract_scanned_pdf(file_path)
    
    return text

def is_low_quality(text):
    # Heuristic: if very little text extracted, likely scanned
    avg_chars_per_page = sum(len(p['content']) for p in text) / len(text)
    return avg_chars_per_page < 100

Advanced: Layout-Aware Extraction

# Use pdfplumber for better layout preservation
import pdfplumber

def extract_with_layout(file_path):
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            # Extract text with positions
            text = page.extract_text()
            
            # Extract tables separately
            tables = page.extract_tables()
            
            # Extract images
            images = page.images
            
            yield {
                'text': text,
                'tables': tables,
                'images': images
            }

Best Practices

  • Detection: Always check if PDF is native or scanned
  • Fallback: Have OCR ready for scanned docs
  • Tables: Extract separately with specialized tools
  • Images: Process embedded images separately

Next: Image processing.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn