Cloud-native backup and restore (Velero)

Cloud-native backup and restore (Velero)

Master the insurance policy. Learn how to use Velero to back up your cluster's metadata and persistent volumes, enabling disaster recovery and painless migrations between cloud providers.

Velero: The Cluster Insurance Policy

What happens if you accidentally delete your production namespace? Or if an entire AWS region goes offline? Or if a ransomware attack encrypts your etcd database?

If you don't have a backup, you have to rebuild everything from scratch. You lose your ConfigMaps, your Secrets, your deployment history, and worst of all, you lose the data in your Persistent Volumes (databases, AI training logs, user uploads).

Standard "cloud backups" of your underlying virtual machines are not enough for Kubernetes. You need a Kubernetes-Native backup tool. In this lesson, we will master Velero. Velero is the industry standard for backing up both the cluster "Brain" (metadata) and the cluster "Body" (persistent data) to an external cloud vault (S3/GCS/Azure Blob).


1. Why standard backups fail in K8s

In a traditional server, you just "Snapshot the Disk." In Kubernetes, your state is split:

  • Metadata: Lives in etcd (managed by the cloud provider). You can't just "Snapshot" it easily.
  • Data: Lives in cloud-native disks (EBS, PD, Azure Disk) attached to specific nodes.

If you restore the disk without the metadata, the pods won't know the disk exists. If you restore the metadata without the disk, the pods will crash. Velero solves this by synchronizing the two.


2. How Velero Works: The Backup Lifecycle

Velero is an Operator (Module 12.2) that runs in your cluster.

  1. Request: You run velero backup create my-backup.
  2. Metadata Capture: Velero queries the API Server for every resource in the cluster (or a specific namespace) and saves the JSON to a file.
  3. Data Capture: Velero talks to the cloud provider (AWS/GCP/Azure) and tells it to "Snapshot" every Persistent Volume Claim (PVC) associated with those resources.
  4. Export: Both the JSON files and the snapshot references are bundled into a "Backup Package" and uploaded to a Cloud Storage Bucket.

3. Disaster Recovery: The Restore Workflow

Imagine the worst has happened. Your cluster is gone.

  1. Step 1: Spin up a brand-new, empty Kubernetes cluster.
  2. Step 2: Install Velero on the new cluster.
  3. Step 3: Point Velero to the same storage bucket.
  4. Step 4: Run velero restore create --from-backup my-backup.

Velero will recreate the Namespaces, then the Secrets, then the PVCs. It will tell the cloud provider to "Rehydrate" the disks from the snapshots. Finally, it creates the Pods, which wake up and find their data exactly where they left it.


4. Visualizing the Velero Safety Net

graph TD
    subgraph "Live Kubernetes Cluster"
        P["Pods / Services"]
        PV["Persistent Volumes (Data)"]
        V["Velero Agent"]
    end
    
    subgraph "External Cloud Vault"
        Bucket["S3 / GCS Bucket"]
        Snap["Cloud Snapshots"]
    end
    
    V -- "1. Scrape Metadata" --> P
    V -- "2. Trigger Snapshot" --> PV
    V -- "3. Upload Package" --> Bucket
    PV -- "Image Data" --> Snap
    
    style Bucket fill:#f96,stroke:#333
    style Snap fill:#9cf,stroke:#333

5. Migrating Between Clouds

Because Velero stores your metadata in a standardized format, it is the #1 tool for Cloud Migration.

  • Step 1: Backup your production cluster in AWS (EKS).
  • Step 2: Restore that backup into a new cluster in Google Cloud (GKE).

Velero is smart enough to translate "AWS EBS" volumes into "GCP Persistent Disks" during the restore process (provided you have configured the plugins correctly).


6. Practical Example: Scheduling Automated Backups

You shouldn't run backups manually. Use a Schedule (Cron job for Velero).

# Create a backup that runs every 24 hours and is kept for 30 days
velero schedule create daily-backup \
  --schedule="0 1 * * *" \
  --ttl 720h0m0s

7. AI Implementation: Protecting "Non-Git" Data

In AI development, some things are in Git (code), but some are not:

  • Training Progress: Checkpoints of an LLM that is half-baked.
  • Vector Database: 50GB of Pinecone or Milvus embeddings stored locally.
  • Dataset Cache: Pre-processed images that took 48 hours to generate.

The AI Insurance Policy:

If your cluster crashes, you don't want to lose 48 hours of expensive GPU compute.

  1. Use Velero to backup your Vector DB namespace.
  2. Run the backup every 4 hours.
  3. In the event of a crash, you lose a maximum of 4 hours of work, instead of 48.

8. Summary and Key Takeaways

  • Velero: The essential disaster recovery tool for K8s.
  • Metadata + Data: Velero backs up both the YAML and the physical disks.
  • Cloud-Native: Integrates directly with S3, EBS, Azure Blob, etc.
  • Restoration: Allows for the total recreation of a cluster from a simple bucket.
  • Mobility: Enables easy migration between clouds (EKS to GKE).
  • Disaster Recovery Drills: Always test your restore process! A backup you haven't restored is just "Stored Garbage."

Congratulations!

You have completed Module 13: Kubernetes on Cloud Platforms. You are now a master of the "Real World." You can deploy to AWS, GCP, and Azure, manage global fleets, and ensure your data is safe from any disaster.

Next Stop: In Module 14: Real-World Projects, we will put everything together to build a Production-Grade AI Inference Pipeline.


9. SEO Metadata & Keywords

Focus Keywords: Kubernetes disaster recovery with Velero, backing up K8s persistent volumes AWS S3, Velero restore tutorial, migrate EKS to GKE with Velero, K8s backup vs etcd snapshot, scheduling Velero backups.

Meta Description: Don't let a cloud outage kill your business. Learn how to use Velero, the industry-standard backup tool, to protect your Kubernetes metadata and data, enabling instant disaster recovery and seamless migrations for your AI and web applications.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn