
Multimodal Prompting: Asking the Right Questions
Prompts for images are different than prompts for text. Learn techniques like Spatial Referencing and Visual Reasoning.
Multimodal Prompting
When prompting with vision, you need to be precise about Spatial and Visual details.
Spatial Referencing
Gemini understands 2D space reasonably well.
- Prompt: "What is the object in the bottom-left corner?"
- Prompt: "Read the text that is physically above the red button."
Bounding Boxes (Output)
You can ask Gemini to return coordinates (normalized 0-1000).
- Prompt: "Return the bounding box of the cat in [ymin, xmin, ymax, xmax] format."
- Result:
[200, 300, 500, 600] - Use: You can draw a box on the image in your UI using these numbers.
Optical Character Recognition (OCR)
Gemini is a world-class OCR engine.
- Prompt: "Transcribe this handwritten note exactly. Preserve line breaks."
- Capability: It handles messy handwriting, tables, and multi-column PDFs better than traditional OCR tools.
Summary
- Ask for "Where" (Spatial).
- Ask for "What does it say" (OCR).
- Ask for coordinates if you need to highlight objects.
In the final lesson of this module, we look at Use Cases.