Horizontal Scaling: Growing with your Data

Eventually, your vector database will outgrow a single server. Whether you hit a CPU bottleneck (too many queries) or a RAM bottleneck (too many vectors), you need to Scale Horizontally. This means adding more machines to your cluster rather than just making one machine bigger.

In this lesson, we explore the architecture of distributed vector search.

1. Scaling the Search (Throughput)

If your database can handle 100 queries per second (QPS) but you have 1,000 users, you need Replicas.

The Concept: Copy the entire index to multiple servers.
The Load Balancer: Distributes incoming queries across these copies.
Result: You can handle infinite traffic by simply adding more "Read" servers.

2. Scaling the Storage (Capacity)

If your index is 500GB but your server only has 256GB of RAM, replicas won't help. You need Sharding (or Partitioning).

The Concept: Split the 500GB index into four 125GB "Shards."
The Cluster: Store each shard on a different machine.
Result: You can store billions of vectors by spreading the data across a cluster.

3. The "Scatter-Gather" Pattern

In a horizontally scaled system, a single query follows this flow:

Scatter: The "Coordinator" node sends the query to every shard in the cluster.
Local Search: Each shard finds its own local Top-K results.
Gather: The Coordinator collects the results from everyone.
Merge: The Coordinator re-sorts the results and sends the global Top-K back to the user.

graph TD
    U[User Query] --> C[Coordinator Node]
    C --> S1[Shard A]
    C --> S2[Shard B]
    C --> S3[Shard C]
    S1 --> G[Merge & Top-K]
    S2 --> G
    S3 --> G
    G --> R[Final Result]

4. Managed vs. Self-Hosted Scaling

Scaling a vector database is complex.

Pinecone/Milvus Managed: Scaling is usually "Automatic." You move a slider or change your "Pod Count," and the database handles the sharding and replication for you.
Elasticsearch/OpenSearch: You must manually configure its number_of_shards and number_of_replicas.

5. Summary and Key Takeaways

Replicas = Speed: Use replicas to scale the number of simultaneous queries.
Shards = Size: Use sharding to scale the total number of vectors you can store.
Overhead: Horizontal scaling adds "Network Latency" as the coordinator has to wait for every shard to respond.
Stateless Cohesion: Ensure your embeddings model is the same across all nodes in the cluster.

In the next lesson, we’ll dive deeper into the mechanics of Sharding Strategies.

Horizontal Scaling: Growing with your Data

Horizontal Scaling: Growing with your Data

1. Scaling the Search (Throughput)

2. Scaling the Storage (Capacity)

3. The "Scatter-Gather" Pattern

4. Managed vs. Self-Hosted Scaling

5. Summary and Key Takeaways

Congratulations on completing Module 15 Lesson 1! You are now thinking in distributed systems.

Subscribe to our newsletter