
Pinecone Index Configuration: Optimizing the Schema for Search
Learn how to tune your Pinecone index settings. Explore metadata indexing configurations, choosing the right distance metric, and the impact of pod types on performance.
Pinecone Index Configuration
When you create an index in Pinecone, you aren't just giving it a name. You are defining the Physical and Logical Constraints of your search system. In a managed environment, many settings are hidden, but the ones you can control are critical for performance and cost.
In this lesson, we will go deep into index configuration. We will learn which distance metric to choose for which AI model, how to use Metadata Config to save on storage, and how to scale your index using the Pinecone API.
1. Choosing the Right Distance Metric
Pinecone supports three metrics: cosine, dotproduct, and euclidean. As we learned in Module 2, the choice depends entirely on your Embedding Model.
cosine(Recommended for most text models): Use this for OpenAI (text-embedding-3), Cohere, and HuggingFace. It measures the direction of the vector and is robust to varying document lengths.dotproduct(For Maximum Performance): If your model outputs "normalized" vectors (length of 1.0), use Dot Product. It is mathematically simpler and slightly faster for the Pinecone engine to process.euclidean: Use this for non-text vectors like image features (CLIP) or raw scientific data where the physical distance between points is meaningful.
2. Dimensionality: The Point of No Return
Every Pinecone index has a fixed dimension.
- OpenAI:
1536 - Cohere:
1024or4096 - Llama 3 Embed:
4096
Engineering Warning: If you realize halfway through your project that you want to switch from a 1536D model to a 1024D model, you cannot just "adjust" the settings. You must:
- Delete the index (or create a new one).
- Re-embed every single document in your source database.
- UPSERT millions of vectors again.
Lesson: Finalize your model choice before you start a bulk ingestion.
3. Metadata Configuration (Selective Indexing)
By default, Pinecone indexes every metadata field you provide. If you have 10 fields (e.g., author, date, category, raw_text), Pinecone builds 10 separate metadata indexes.
This is convenient but expensive. Each metadata index consumes storage and can slow down your queries.
The Solution: metadata_config
In your index configuration, you can specify exactly which fields you want to be "Filterable."
# Conceptual configuration during creation
pc.create_index(
name="optimized_index",
dimension=1536,
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
# Only index these two fields for filtering
# Other metadata fields will be STORED but not INDEXED
metadata_config={
"indexed": ["user_id", "is_published"]
}
)
Rule of Thumb: Stop indexing large text blobs (like the actual content of a page) in your metadata. Keep those fields unindexed so they are returned in the search result, but don't waste memory trying to "filter" by them.
4. Upsert Limits and Batching
When configuring your ingestion script, you need to understand Pinecone's Upsert Pipeline.
- You cannot send 1 million vectors in a single API call.
- The standard batch size is 100 to 200 vectors per request.
Optimal Ingestion Pattern:
def batch_upsert(index, data, batch_size=100):
for i in range(0, len(data), batch_size):
batch = data[i:i + batch_size]
index.upsert(vectors=batch)
5. Scaling Pods (The Multi-Pod Strategy)
If you are using Pod-based indices, you can scale them on-the-fly without downtime.
- Scaling Horizontally (More Shards): Used when you run out of space for vectors.
- Scaling Vertically (Higher Pod Type): Used when search latency is too high.
- Scaling Replicas: Used when you have too many simultaneous users (high QPS).
In the Pinecone Dashboard or API, you can increase replicas instantly:
pc.configure_index("my-index", replicas=5)
6. The "Source ID" Pattern
Pinecone does not store your "Raw Files." It stores Vectors. A common configuration mistake is trying to cram 20MB of text into the Pinecone metadata.
The Best Practice:
Store a source_id (like a database UUID or a S3 URL) in the metadata. When your search returns the source_id, your application fetches the full content from your primary database (SQL/NoSQL). This keeps your Pinecone index lean, fast, and cheap.
Summary and Key Takeaways
Index configuration is the bridge between your code and the cloud hardware.
- Pick the Metric early: Usually
cosinefor text. - Dimension is Locked: Choose your embedding model carefully.
- Selective Metadata Indexing is the best way to save money and improve speed.
- Batch your Upserts: Aim for batches of 100 for maximum reliability.
- Reference, don't Duplicate: Store IDs to your main DB in metadata, not giant text blobs.
In the next lesson, we will look at Namespaces and Metadata Filtering, the two primary ways to organize data within a single Pinecone index.
Exercise: Schema Review
You are building an "E-mail Search AI." You have 10 million emails. Metadata:
sender_emailsubjectemail_body(Full Text)received_at(Timestamp)is_spam(Boolean)
- Which distance metric would you choose?
- Which metadata fields would you include in
metadata_config["indexed"]? - Should you store the
email_bodyin the Pinecone metadata? What is the alternative?