
PDFs (Native vs Scanned)
Master PDF processing for RAG, handling both native digital PDFs and scanned documents.
PDFs (Native vs Scanned)
PDFs are ubiquitous but challenging. Understanding native vs scanned PDFs is critical.
PDF Types
graph TD
A[PDF Document] --> B{Type?}
B -->|Native| C[Digital Text]
B -->|Scanned| D[Image of Text]
C --> E[Direct Text Extraction]
D --> F[OCR Required]
style E fill:#d4edda
style F fill:#fff3cd
Native PDFs
import PyPDF2
def extract_native_pdf(file_path):
with open(file_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
pages = []
for page_num, page in enumerate(reader.pages):
text = page.extract_text()
pages.append({
'page': page_num + 1,
'content': text,
'type': 'native'
})
return pages
Scanned PDFs
from pdf2image import convert_from_path
import pytesseract
def extract_scanned_pdf(file_path):
# Convert PDF pages to images
images = convert_from_path(file_path)
pages = []
for i, image in enumerate(images):
# OCR each page
text = pytesseract.image_to_string(image)
pages.append({
'page': i + 1,
'content': text,
'type': 'scanned-ocr'
})
return pages
Hybrid Approach
def process_pdf(file_path):
# Try native extraction first
text = extract_native_pdf(file_path)
# Check if extraction was successful
if is_low_quality(text):
# Fall back to OCR
text = extract_scanned_pdf(file_path)
return text
def is_low_quality(text):
# Heuristic: if very little text extracted, likely scanned
avg_chars_per_page = sum(len(p['content']) for p in text) / len(text)
return avg_chars_per_page < 100
Advanced: Layout-Aware Extraction
# Use pdfplumber for better layout preservation
import pdfplumber
def extract_with_layout(file_path):
with pdfplumber.open(file_path) as pdf:
for page in pdf.pages:
# Extract text with positions
text = page.extract_text()
# Extract tables separately
tables = page.extract_tables()
# Extract images
images = page.images
yield {
'text': text,
'tables': tables,
'images': images
}
Best Practices
- Detection: Always check if PDF is native or scanned
- Fallback: Have OCR ready for scanned docs
- Tables: Extract separately with specialized tools
- Images: Process embedded images separately
Next: Image processing.