
OCR Accuracy and Error Handling
Techniques for measuring OCR performance, cleaning noisy outputs, and building resilient pipelines.
OCR Accuracy and Error Handling
OCR is never 100% accurate. Misread characters (e.g., 'O' vs '0', 'l' vs '1') can break your embeddings and search quality. A production-grade RAG system needs strategies to handle these errors.
Measuring Accuracy (WER and CER)
- Word Error Rate (WER): Measures the ratio of word-level errors.
- Character Error Rate (CER): Measures character-level changes (insertions, deletions, substitutions).
import Levenshtein
def calculate_cer(reference, hypothesis):
distance = Levenshtein.distance(reference, hypothesis)
return distance / len(reference)
Common OCR Errors
- Garbage Text: Random characters like
~or|appearing near lines or edges. - Merging Words: "TheNew" instead of "The New".
- Broken Formatting: Multi-column text being read line-by-line across columns.
Post-Processing OCR Text
Use NLP models or spell-checkers to clean the output.
from textblob import TextBlob
def spell_check_ocr(text):
b = TextBlob(text)
return str(b.correct())
LLM-Based Correction
Since we are building a RAG system, we can use an LLM (like Claude) to "fix" the OCR noise:
prompt = f"Fix the common OCR errors in the following text while preserving the original meaning: {noisy_text}"
Building a Resilient Pipeline
- Thresholding: Reject chunks with an
average_ocr_confidencebelow a certain level (e.g., 0.70). - Double-Indexing: Index the "raw" OCR text and the "cleaned" version.
- Fuzzy Search: Use vector embeddings that are robust to small spelling variations.
Error Handling Scenarios
| Scenario | Strategy |
|---|---|
| Hand-written notes | Use a high-quality Vision model (e.g. Claude) |
| Low-resolution scan | Upscale and use OCR_DPI optimization |
| Mixed languages | Use OCR engines with multi-language support (Tesseract --lang) |
Exercises
- Intentionally smudge a piece of paper, scan it, and run OCR.
- Identify 3 consistent errors it makes.
- Write a small script to regex-replace or "clean" those specific errors.