Combining Modalities: Text + Image + Video

Combining Modalities: Text + Image + Video

The power of mixing inputs. Learn to build advanced prompts that reference multiple images, text instructions, and video snippets simultaneously.

Combining Modalities

You aren't limited to 1 image. You can pass a sequence.

Comparing Images

"Which product looks newer?"

img1 = PIL.Image.open('phone_v1.jpg')
img2 = PIL.Image.open('phone_v2.jpg')

prompt = ["Compare these two phones. Which one has a notch?", img1, img2]
# Gemini understands the order. img1 is first, img2 is second.

Video + Text Context

"Find the scene in the video that matches this script."

script_text = "The hero enters the dark cave."
response = model.generate_content([video_file, "Timestamp where this happens:", script_text])

Interleaved Reasoning

Gemini processes the list in order. [Text A, Image A, Text B, Image B] This allows you to tell a story. "Here is the user's screen (Image A). They clicked this button (Text B). Now the screen looks like this (Image B). What happened?"

Summary

Think of the input list as a timeline. You can place any media object on that timeline, and the model reads it left-to-right.

In the next lesson, we discuss Multimodal Prompting.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn