AI Powered Learning Portal

Text, Image, and Audio Inputs: Multimodality Basics

May 26, 2026

Text, Image, and Audio Inputs: Multimodality Basics

Gemini is more than an LLM. Learn how to pass Images (PIL), Audio, and Video files into the model using the Python SDK.

Text, Image, and Audio Inputs

Gemini accepts a list of content parts. This list can mix text and binary data.

Image Input (PIL)

For images, efficient handling is key.

import PIL.Image

# Load local image
img = PIL.Image.open('cat.jpg')

# Pass directly
response = model.generate_content(["Describe this image:", img])

Audio/Video Input (File API)

For larger files like video/audio, you cannot just pass raw bytes in the list efficiently. You upload them first.

# Upload
video_file = genai.upload_file(path="webinar.mp4")

# Wait for processing (Video takes time to ingest)
import time
while video_file.state.name == "PROCESSING":
    time.sleep(2)
    video_file = genai.get_file(video_file.name)

# Generate
response = model.generate_content([video_file, "Summarize the speakers points."])

Summary

Small Images -> PIL objects.
Large Audio/Video -> upload_file() API.

In the next lesson, we discuss Combining Modalities.

Previous LessonError Handling: Retries and Rate Limits

Next LessonCombining Modalities: Text + Image + Video

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn