Kubernetes for AI: Orchestrating GPU Clusters

If you choose the "Self-Hosted" path (Module 8), you can't just run your AI model on one server and hope for the best. To handle thousands of users, you need Kubernetes (K8s).

Kubernetes is the "Operating System" for the data center. For an LLM Engineer, Kubernetes is the tool that ensures your GPUs are utilized efficiently and your AI agents never go offline.

1. Why Kubernetes for AI?

GPU Scheduling: K8s knows which of your servers has a free GPU and can "Schedule" your model to run there.
Auto-Scaling: If traffic spikes, K8s can automatically spin up 10 more instances of your model.
Self-Healing: If a GPU server crashes, K8s automatically restarts your model on a different server.

2. The AI Deployment Stack on K8s

A professional AI deployment on K8s involves three specific components:

A. NVIDIA Device Plugin

Standard Kubernetes doesn't know what a GPU is. You must install the NVIDIA Device Plugin so K8s can see and manage the VRAM of your cards.

B. Deployment with vLLM

You package your model serving engine (like vLLM) in a Docker Container and ask K8s to run multiple copies (replicas) of it.

C. Resource Limits

In K8s, you must be explicit about your resources.

resources:
  limits:
    nvidia.com/gpu: 1  # Request exactly 1 GPU
  requests:
    cpu: "4"
    memory: "16Gi"

3. Handling "State" in Agentic K8s

AI Agents are "Stateful" (they have memory). If a user is in the middle of a complex 10-step reasoning loop and the K8s pod restarts, the user loses their progress.

The Solution: Use Redis as an external "State Store." Every time an agent completes a step in the graph, it writes the state to Redis. If the K8s pod dies, the new pod reads from Redis and continues exactly where the old one left off.

graph TD
    A[User Request] --> B[Inbound Load Balancer]
    B --> C[K8s Pod 1: Agent]
    B --> D[K8s Pod 2: Agent]
    C --> E[(Global State: Redis)]
    D --> E
    C --> F((GPU Node 1))
    D --> G((GPU Node 2))

4. Serving Different Models via One Cluster

You can use KServe or Seldon Core on top of Kubernetes. These tools allow you to host multiple different models (e.g., Llama 3 for chat and BERT for classification) on the same cluster and route traffic to them based on the prompt.

Code Concept: A Kubernetes Deployment for vLLM

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: vllm-engine
        image: vllm/vllm-openai:latest
        args: ["--model", "meta-llama/Meta-Llama-3-8B"]
        resources:
          limits:
            nvidia.com/gpu: 1

Summary

Kubernetes is the infrastructure layer for self-hosted production AI.
GPU Scheduling ensures your expensive hardware is never idle.
External State (Redis) is required to make agentic workflows resilient to pod restarts.
Docker is the format used to package your AI logic for the cloud.

In the next lesson, we will look at the opposite of Kubernetes: Serverless AI Functions, the lightest way to add AI to your app.

Exercise: The Scaling Crisis

Your K8s cluster has 4 GPUs. You are running 4 pods of Llama 3 (each pod uses 1 GPU). A massive spike of 1,000 new users arrives.

Can your cluster automatically scale to handle these new users?
What is the physical bottleneck? (CPU, RAM, or the number of GPUs?)

Answer Logic:

No, unless you have "Cluster Autoscaling" enabled to rent new servers from the cloud provider.
GPUs. You cannot "Split" a physical GPU between more pods if the VRAM is already full. You need more physical hardware.