
Combining Modalities: Text + Image + Video
The power of mixing inputs. Learn to build advanced prompts that reference multiple images, text instructions, and video snippets simultaneously.
Combining Modalities
You aren't limited to 1 image. You can pass a sequence.
Comparing Images
"Which product looks newer?"
img1 = PIL.Image.open('phone_v1.jpg')
img2 = PIL.Image.open('phone_v2.jpg')
prompt = ["Compare these two phones. Which one has a notch?", img1, img2]
# Gemini understands the order. img1 is first, img2 is second.
Video + Text Context
"Find the scene in the video that matches this script."
script_text = "The hero enters the dark cave."
response = model.generate_content([video_file, "Timestamp where this happens:", script_text])
Interleaved Reasoning
Gemini processes the list in order.
[Text A, Image A, Text B, Image B]
This allows you to tell a story. "Here is the user's screen (Image A). They clicked this button (Text B). Now the screen looks like this (Image B). What happened?"
Summary
Think of the input list as a timeline. You can place any media object on that timeline, and the model reads it left-to-right.
In the next lesson, we discuss Multimodal Prompting.