Building Agents that See: Image Reasoning and Visual Analysis

In the previous modules, we focused on text-based agents. But humans experience the world through vision, and so should our agents. A customer support agent is far more effective if it can see a photo of a broken product. A financial agent is more accurate if it can read a complex chart rather than just a CSV of numbers.

Because Gemini is natively multimodal, we don't need a separate "Vision AI" service. We can send raw images directly to our agents. In this lesson, we will explore the technical mechanics of visual input, learn how to build "Seeing Agents," and explore the power of spatial reasoning.

1. The Multimodal Request Structure

In the Gemini ADK, every message sent to the model is composed of a list of Parts. Until now, we've only used Text parts. To give our agent eyes, we simply include an Image part in that same list.

Concept: The Unified Content List

Instead of: [ "What is in this image?" ] We send: [ "What is in this image?", image_data ]

Gemini processes these two parts simultaneously, allowing the text and the pixels to interact in its reasoning engine.

2. Image Processing Techniques

There are two ways to provide images to a Gemini ADK agent.

A. Inline Data (Small Images)

For small .png, .jpg, or .webp files (under 4MB), you can encode them as Base64 and send them directly in the request. This is fast and requires no extra infrastructure.

B. File API (Large/Multiple Images)

For large datasets or high-resolution photos, you use the Google File API. You upload the file once, get a URI, and pass that URI to the agent. This is more efficient for tokens and bandwidth.

3. Visual Reasoning Categories

When you build a visual agent, it performs three primary types of reasoning:

Object Recognition: "What is this?" (e.g., Identifying a specific brand of router).
Contextual Analysis: "What is happening here?" (e.g., Observing that the "Power" light on the router is red).
Instruction Following via Vision: "Read the serial number from this sticker and use the 'check_warranty' tool."

graph LR
    A[Image Input] --> B[Gemini Vision Engine]
    B --> C[OCR: Extracts Serial #]
    B --> D[Visual: Sees Red Light]
    C --> E[Agent Reasoner]
    D --> E
    E --> F[Tool: check_warranty]
    F --> G[Resolution: Device is under warranty. Contact support.]
    
    style B fill:#4285F4,color:#fff

4. Spatial Reasoning and Coordinates

One of Gemini's most powerful visual features is Bounding Boxes. You can ask the agent to find an object and return its location in the image.

Example Prompt: "Find the dog in this photo and provide its coordinates in [ymin, xmin, ymax, xmax] format."

Agent Use Case: Build a web-scraping agent that "sees" a screenshot of a website and identifies exactly where the "checkout" button is located so it can click it.

5. Implementation: The "Visual Support" Agent

Let's build a simple script that acts as a technical support agent capable of reading a screenshot.

import google.generativeai as genai
from PIL import Image

# 1. Load the Image
img = Image.open('dashboard_error.png')

# 2. Setup Agent
model = genai.GenerativeModel('gemini-1.5-flash')

# 3. Multimodal Inference
prompt = [
    "You are a Cloud Support Assistant. Analyze this screenshot of the dashboard.",
    "1. What is the error message shown in red?",
    "2. Propose a 3-step fix based on the settings visible in the image.",
    img
]

response = model.generate_content(prompt)

print("--- AGENT ANALYSIS ---")
print(response.text)

6. Tips for High-Performance Visual Agents

Resolution Matters: If you need the agent to read fine print (like a serial number), don't downscale the image too much.
Prompt the Vision: If you just send an image, Gemini will summarize it. If you want a specific detail, point the agent's "attention" there. "Look at the bottom-right corner of the circuit board. Is there any evidence of a burn mark?"
Combined Modalities: You can send multiple images. "Here is the product manual and here is a photo of the product. Is the user holding it correctly?"

7. Limitations of Vision

Fine Detail: Gemini might struggle with extremely dense text (e.g., a 1,000-page bank statement in one tiny image). In those cases, use PDF parsing (coming in the next module).
Optical Illusions: Like humans, models can sometimes be tricked by perspective or lighting.
Speed: Processing pixels takes more compute than processing text. Expect a slight increase in latency for visual requests.

8. Summary and Exercises

Visual agents bridge the gap between digital reasoning and the physical world.

Native Multimodality allows for unified reasoning of text and images.
Part-based requests are the technical mechanism for sending images.
Visual Reasoning covers identification, context, and instruction following.
Spatial Reasoning allows agents to locate objects within a 2D space.

Exercises

Document Analysis: Take a photo of a receipt or a utility bill. Ask Gemini to: "Extract the Date, the Total Amount, and the Vendor Name into a JSON object."
Comparison Flow: Send two images of your desk (one tidy, one messy). Ask Gemini: "What are the 3 biggest differences between these two photos?"
Logic from Vision: Draw a simple flowchart by hand on a piece of paper. Take a photo and ask Gemini: "Convert this flowchart into a Mermaid.js diagram code." (This is a classic 'Developer' agent task).

In the next lesson, we will listen in, as we explore Building Agents that Hear through audio processing.