
When OCR is Required
Identify the triggers for Optical Character Recognition (OCR) and learn how to detect non-searchable document components.
When OCR is Required
Not all documents are easy to read. While "native" PDFs contain actual text data, many documents in the real world are just "digital pictures" of text. This is where Optical Character Recognition (OCR) becomes a necessity for RAG.
The OCR Trigger
You need OCR whenever text is represented as pixels rather than paths.
Common OCR Scenarios
- Scanned Documents: Paper records that were scanned into PDF format.
- Screenshots: Captures of dashboards, code snippets, or error messages.
- Embedded Text in Images: Text within diagrams, flowcharts, or infographics.
- Faxes: Legacy digital formats that are essentially bitmaps.
How to Detect if a PDF Needs OCR
A quick way to check programmatically is to try extracting text. If the extracted text is empty or has a very low "text-to-image" ratio, you need OCR.
import fitz # PyMuPDF
def needs_ocr(pdf_path):
doc = fitz.open(pdf_path)
for page in doc:
text = page.get_text()
if len(text.strip()) > 50:
return False # Likely a native PDF
return True # Likely scanned
The Cost of OCR
OCR is computationally expensive and introduces latency. It also introduces noise.
- Naive RAG: Try to extract text natively first.
- Robust RAG: If native extraction fails or returns gibberish, fall back to OCR.
OCR vs. Vision Models
Modern Multimodal LLMs (like GPT-4o or Claude 3.5 Sonnet) can "read" text directly from images. However, for large-scale ingestion (thousands of pages), traditional OCR engines (like Tesseract or AWS Textract) are usually more cost-effective for initial indexing.
| Feature | Traditional OCR | Vision LLM |
|---|---|---|
| Cost | Low | High |
| Speed | Fast | Slow |
| Contextual Understanding | Low | Very High |
| Accuracy on Handwriting | Medium | High |
Exercises
- Find a PDF where you cannot select or highlight the text with your mouse.
- Use a simple Python script with
pdf2imageandpytesseractto see how much text it can find. - Compare the output to a "Native" PDF.