Database Clusters: Scaling your Data

You have 10 web servers, and they are all perfectly synced. But they all talk to a Single Database Server.

This is a Disaster waiting to happen.

If your database server crashes, your entire web cluster goes dark. Furthermore, if your website becomes famous, one database server won't be fast enough to handle 10,000 visitors at once. In this lesson, we will learn how to build a Database Cluster.

1. Primary-Replica (Master-Slave) Architecture

This is the most common setup for 90% of the world's websites.

The Primary: Every "Write" (New user, new comment) goes here.
The Replica: Every "Read" (Viewing a post, searching) goes here.

How it works:

The Primary server keeps a "Diary" of every change it makes (called the Binary Log or Binlog).
The Replica "Shadows" the Primary, constantly reading the diary and making the exact same changes to its own local data.

graph TD
    App[Web Application] -- Writes --> M[Primary DB]
    App -- Reads --> R1[Replica DB 1]
    App -- Reads --> R2[Replica DB 2]
    M -- Binlog Stream --> R1
    M -- Binlog Stream --> R2

2. Multi-Primary (Master-Master) Architecture

This is for high-end applications where you need to be able to write to Any server in the cluster.

Used by: Galera Cluster (MariaDB) and Vitess.
Benefit: If Server A dies, you can just keep writing to Server B without changing anything.
Drawback: Massive network complexity. If two people update the same record on two different servers at the same time, the "Conflict Resolution" is very difficult.

3. The "Read-Write Split" Logic

How does your code know to send "SELECT" queries to the Replica and "INSERT" queries to the Primary?

In the Code: You manually define two database connections in your app.
In the Load Balancer: You use a tool like ProxySQL or pgpool-II. The app thinks it's talking to one database, but the proxy reads the SQL:
- Does the SQL start with "SELECT"? -> Send to Replica.
- Does it start with "UPDATE"? -> Send to Primary.

4. Practical: Configuring a MySQL Replica

On Primary: Enable the binlog and set a server-id=1.
On Replica: Set server-id=2.
The Handshake: Give the Replica a username and password with "REPLICATION SLAVE" permissions.
The Start: Run CHANGE MASTER TO... on the replica and point it at the Primary's current "Diary Position."

5. Identifying Replication Lag

Sometimes, the Primary is doing so much work that the Replica can't keep up. The Replica might be 5 or 10 minutes "behind" the Primary. To the user, it looks like their new post has "Disappeared" right after they saved it!

-- In MySQL, check if your replica is healthy
SHOW REPLICA STATUS\G
-- Look for 'Seconds_Behind_Master: 0'

6. Example: A Replication Watchdog (Python)

You need to be alerted the moment your database mirror stops working. Here is a Python script that connects to your replica and alerts you if the "Lag" becomes dangerous.

import mysql.connector

def check_db_health(host, user, password):
    """
    Checks for replication health and lag.
    """
    print(f"--- DB Health Audit: {host} ---")
    
    try:
        db = mysql.connector.connect(host=host, user=user, password=password)
        cursor = db.cursor(dictionary=True)
        
        cursor.execute("SHOW REPLICA STATUS")
        status = cursor.fetchone()
        
        if not status:
            print("[!!!] ALERT: This server is not configured as a replica!")
            return

        lag = status['Seconds_Behind_Master']
        io_running = status['Replica_IO_Running']
        sql_running = status['Replica_SQL_Running']
        
        if io_running == "Yes" and sql_running == "Yes":
            print(f"[OK] Replication is running. Current Lag: {lag} seconds.")
            if int(lag) > 30:
                print("[WA] Warning: High replication lag detected!")
        else:
            print("[!!!] CRITICAL: REPLICATION HAS STOPPED!")
            
    except Exception as e:
        print(f"Could not connect to database: {e}")

if __name__ == "__main__":
    # check_db_health("10.0.0.50", "admin", "secret")
    pass

7. Professional Tip: Use 'GTID'

Older database systems used "File Positions" (e.g., "Replica, look at file log.001 at line 500"). If a crash happened, it was a nightmare to find the right spot again. Modern databases use GTID (Global Transaction ID). Every change has a unique ID, making it much easier to recover a failed cluster.

8. Summary

Database replication is the ultimate protection for your data.

Primary-Replica scales your "Read" performance.
Multi-Primary scales your "Write" performance and availability.
Replication Lag is your #1 enemy.
The Binary Log is the "Source of Truth" for mirroring.
ProxySQL is the brain that manages the traffic split.

In the final lesson of this module, we will learn how to handle the "Outer Layer" of HA: Dynamic DNS and Geo-Routing.

Quiz Questions

Why can't you just point your web app's "Write" queries to a Replica server?
What is "Replication Lag" and what causes it to increase?
What is the benefit of a "Read-Write Split" for a high-traffic website?

Continue to Lesson 6: Global HA—Dynamic DNS and Geo-Routing.

The Distributed Brain: Database Replication