When OCR is Required

Not all documents are easy to read. While "native" PDFs contain actual text data, many documents in the real world are just "digital pictures" of text. This is where Optical Character Recognition (OCR) becomes a necessity for RAG.

The OCR Trigger

You need OCR whenever text is represented as pixels rather than paths.

Common OCR Scenarios

Scanned Documents: Paper records that were scanned into PDF format.
Screenshots: Captures of dashboards, code snippets, or error messages.
Embedded Text in Images: Text within diagrams, flowcharts, or infographics.
Faxes: Legacy digital formats that are essentially bitmaps.

How to Detect if a PDF Needs OCR

A quick way to check programmatically is to try extracting text. If the extracted text is empty or has a very low "text-to-image" ratio, you need OCR.

import fitz # PyMuPDF

def needs_ocr(pdf_path):
    doc = fitz.open(pdf_path)
    for page in doc:
        text = page.get_text()
        if len(text.strip()) > 50:
            return False # Likely a native PDF
    return True # Likely scanned

The Cost of OCR

OCR is computationally expensive and introduces latency. It also introduces noise.

Naive RAG: Try to extract text natively first.
Robust RAG: If native extraction fails or returns gibberish, fall back to OCR.

OCR vs. Vision Models

Modern Multimodal LLMs (like GPT-4o or Claude 3.5 Sonnet) can "read" text directly from images. However, for large-scale ingestion (thousands of pages), traditional OCR engines (like Tesseract or AWS Textract) are usually more cost-effective for initial indexing.

Feature	Traditional OCR	Vision LLM
Cost	Low	High
Speed	Fast	Slow
Contextual Understanding	Low	Very High
Accuracy on Handwriting	Medium	High

Exercises

Find a PDF where you cannot select or highlight the text with your mouse.
Use a simple Python script with pdf2image and pytesseract to see how much text it can find.
Compare the output to a "Native" PDF.