AI Powered Learning Portal

Multimodal Prompting: Asking the Right Questions

May 26, 2026

Multimodal Prompting: Asking the Right Questions

Prompts for images are different than prompts for text. Learn techniques like Spatial Referencing and Visual Reasoning.

Multimodal Prompting

When prompting with vision, you need to be precise about Spatial and Visual details.

Spatial Referencing

Gemini understands 2D space reasonably well.

Prompt: "What is the object in the bottom-left corner?"
Prompt: "Read the text that is physically above the red button."

Bounding Boxes (Output)

You can ask Gemini to return coordinates (normalized 0-1000).

Prompt: "Return the bounding box of the cat in [ymin, xmin, ymax, xmax] format."
Result: [200, 300, 500, 600]
Use: You can draw a box on the image in your UI using these numbers.

Optical Character Recognition (OCR)

Gemini is a world-class OCR engine.

Prompt: "Transcribe this handwritten note exactly. Preserve line breaks."
Capability: It handles messy handwriting, tables, and multi-column PDFs better than traditional OCR tools.

Summary

Ask for "Where" (Spatial).
Ask for "What does it say" (OCR).
Ask for coordinates if you need to highlight objects.

In the final lesson of this module, we look at Use Cases.

Previous LessonCombining Modalities: Text + Image + Video

Next LessonUse Cases and Applications: Multimodal in the Wild

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn