Embeddings, Text, and Multimodal Capabilities

To build advanced AI apps (like RAG or Search), you need to understand Embeddings.

What is an Embedding?

Computers cannot understand the word "King" or the image of a "Crown." They only understand numbers.

An Embedding is a list of floating-point numbers (a vector) that represents the meaning of a piece of content.

The embedding for "Dog" will be mathematically close to the embedding for "Puppy."
The embedding for "Dog" will be far away from "Sandwich."

Gemini offers powerful Text Embeddings (text-embedding-004) and Multimodal Embeddings (multimodal-embedding-001).

Visualizing Vector Space

Imagine a 3D graph (in reality, it has hundreds of dimensions).

Concept Cluster: All "Fruit" words (Apple, Banana, Orange) are clustered together in one corner.
Concept Cluster: All "Tech" words (Computer, Server, AI) are in another corner.

When we search, we turn the user's query ("Red luscious fruit") into a vector and find the nearest data points (vectors) in that space.

Multimodal Embeddings: The Magic

Because Gemini is multimodal, it can map images and text to the same vector space.

If you embed:

The Word: "Lion"
An Image: A photo of a Lion in the Savannah.

Their vectors will be close together! This allows you to build Text-to-Image Search. You can search your photo library for "Lion" without ever tagging the photos, because the meaning of the image vector matches the meaning of the text vector.

Code Example: Generating Embeddings

import google.generativeai as genai

# 1. Text Embedding
result = genai.embed_content(
    model="models/text-embedding-004",
    content="The quick brown fox jumps over the lazy dog",
    task_type="retrieval_document"
)

# Print the first 5 dimensions (it usually has 768 dimensions)
print(result['embedding'][:5])
# Output: [0.012, -0.045, 0.098, ...]

Task Types

Notice the task_type. Gemini embeddings are optimized for specific goals:

retrieval_query: Used for the user's question.
retrieval_document: Used for the data you are storing.
classification: Used if you are training a classifier.

Use Case: Semantic Search

Ingest: Take 1,000 support tickets. Generate embeddings for each. Store in a Vector DB (like Pinecone or Chroma).
Query: User asks "How do I reset my password?".
Embed Query: Generate embedding for the question.
Search: Find the tickets with the closest cosine similarity.

Summary

Embeddings are the bridge between human concepts and machine math. Gemini's ability to embed both text and images into a shared space unlocks powerful search, recommendation, and clustering applications.

In the next lesson, we will uncover how Gemini counts and reads data: Tokenization.