
Embedding and Indexing Layer
Convert preprocessed content into vector embeddings and store them efficiently for retrieval.
Embedding and Indexing Layer
This layer converts content into searchable vector representations.
Embedding Process
graph TD
A[Clean Content] --> B{Content Type}
B -->|Text| C[Text Embedding Model]
B -->|Image| D[Vision Embedding Model]
B -->|Both| E[Multimodal Embedding]
C & D & E --> F[Vector Embeddings]
F --> G[Vector Database]
G --> H[Indexed & Searchable]
Embedding Models
Text Embeddings
# OpenAI Ada-002
embedding = openai.embed("This is a document about cats")
# Returns: [0.023, -0.104, 0.445, ...] (1536 dimensions)
# Cohere Embed v3
embedding = cohere.embed(texts=["Document text"], model="embed-english-v3.0")
Image Embeddings
# CLIP
image_emb = clip.encode_image(image)
text_emb = clip.encode_text("a photo of a cat")
# Both in same vector space!
Indexing Strategy
# Conceptual vector DB indexing
from chromadb import Client
client = Client()
collection = client.create_collection("documents")
collection.add(
embeddings=[embedding_vector],
documents=[text_content],
metadatas=[{'source': 'doc.pdf', 'page': 1}],
ids=['doc-1-page-1']
)
Index Optimization
- Chunking: Split large docs into 500-1000 token chunks
- Overlap: 10-20% overlap between chunks
- Metadata: Store source, page, section for filtering
- IDs: Unique identifiers for tracking
Next: Retrieval and ranking.