Text, Image, and Audio Inputs: Multimodality Basics

Text, Image, and Audio Inputs: Multimodality Basics

Gemini is more than an LLM. Learn how to pass Images (PIL), Audio, and Video files into the model using the Python SDK.

Text, Image, and Audio Inputs

Gemini accepts a list of content parts. This list can mix text and binary data.

Image Input (PIL)

For images, efficient handling is key.

import PIL.Image

# Load local image
img = PIL.Image.open('cat.jpg')

# Pass directly
response = model.generate_content(["Describe this image:", img])

Audio/Video Input (File API)

For larger files like video/audio, you cannot just pass raw bytes in the list efficiently. You upload them first.

# Upload
video_file = genai.upload_file(path="webinar.mp4")

# Wait for processing (Video takes time to ingest)
import time
while video_file.state.name == "PROCESSING":
    time.sleep(2)
    video_file = genai.get_file(video_file.name)

# Generate
response = model.generate_content([video_file, "Summarize the speakers points."])

Summary

  • Small Images -> PIL objects.
  • Large Audio/Video -> upload_file() API.

In the next lesson, we discuss Combining Modalities.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn