Image Preprocessing for Retrieval

In Multimodal RAG, we don't just "store" images; we must prepare them so that search models (like CLIP) or vision models can understand them effectively.

Resizing and Normalization

Most visual embedding models (like CLIP or ViT) expect images in a specific size (e.g., 224x224 or 336x336 pixels).

from PIL import Image
from torchvision import transforms

def preprocess_image(image_path):
    img = Image.open(image_path).convert('RGB')
    
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    return transform(img)

Enhancement Techniques

If images are dark, blurry, or low-contrast, your RAG system might fail to retrieve them.

Contrast Stretching: Useful for medical or technical imagery.
Denoising: Essential for low-light photos.
Binarization: Good for text-heavy images.

Capturing Metadata during Ingestion

When you ingest an image, you should also calculate:

Aspect Ratio: To help with UI display later.
Histogram: To detect image quality.
Dominant Colors: Can be useful for some retrieval scenarios.

Visual Feature Extraction

Before indexing, we often run several "feature detectors":

Object Detection: "There is a car at [x,y]."
Face Detection: "There is a person present."
Color Palettes: Useful for design/fashion RAG.

Case Study: Diagrams

For diagrams, "Preprocessing" might mean more than just pixels. It might involve separating the labels from the arrows using computer vision techniques.

Exercises

Download a "noisy" or blurry image.
Use Python's cv2 or PIL to sharpen it.
Compare the visual quality before and after. How might this affect a search engine's ability to find it?