Scaling & Optimization: Handling the Load
·ProfessionalEngineeringCertifications

Scaling & Optimization: Handling the Load

How to survive Black Friday. Learn about Autoscaling, GPU Inference, TF-TRT, and optimizing latency for high-throughput serving.

Performance Engineering

Getting the model to "Running" state is Step 1. Getting it to run fast and cheap is Step 2.

The two metrics you fight are:

  1. Latency: "Time per request" (e.g., 50ms).
  2. Throughput: "Requests per second" (e.g., 10,000 QPS).

1. Autoscaling Strategies

Vertex AI Prediction scales based on CPU/GPU Utilization.

  • Target Utilization: Defaults to 60%.
  • If CPU load > 60%, add a node.
  • If CPU load < 60%, remove a node.

The Cold Start Problem: It takes ~2-3 minutes to spin up a new node.

  • Risk: If traffic spikes instantly (Black Friday start), the scaler is too slow, and users see errors.
  • Fix: Min Replica Count. Set min_replica_count=10 before the event starts to pre-warm the fleet.

2. Hardware Acceleration for Serving

Do you need a GPU for Serving?

  • Recommendation Models (Tabular): No. Use CPU. It's IO-bound, not compute-bound.
  • ResNet/Bert (Vision/Text): Yes. Use T4 GPUs.

NVIDIA TensorRT (TF-TRT): This is a compiler that optimizes TensorFlow graphs specifically for NVIDIA GPUs. It fuses layers and optimizes memory usage.

  • Impact: Can reduce latency by 2x-4x.
  • Action: Convert your SavedModel using TF-TRT before deploying.

3. Server-Side Batching

This is counter-intuitive. Scenario: You receive 100 requests per second. Naive Way: Process 1 at a time. This thrashes the GPU memory bandwidth. Smart Way: Wait 5ms, collect 5 requests into a Batch, and process them all at once.

  • Result: Throughput increases massively. Latency per user increases slightly (by 5ms), but the system doesn't crash.

4. Visualizing Scalability

graph TD
    Traffic[User Requests] --> LB[Load Balancer]
    
    subgraph "Vertex AI Endpoint"
    LB --> Replica1[Replica 1 (CPU &lt; 60%)]
    LB --> Replica2[Replica 2 (CPU &lt; 60%)]
    
    Replica1 -.->|Load > 60%| AutoScaler
    AutoScaler -.->|Events| Replica3[Provision Replica 3]
    end
    
    style AutoScaler fill:#EA4335,stroke:#fff,stroke-width:2px,color:#fff

5. Summary

  • Autoscaling saves money but has a lag time.
  • Min Replicas protects against traffic spikes.
  • TensorRT optimizes Deep Learning models for inference speed.
  • Server-Side Batching improves Throughput at the cost of slight Latency.

In the next lesson, we connect the dots. We move from "Manual Steps" to Pipelines.


Knowledge Check

?Knowledge Check

Your model endpoint crashes every day at 9:00 AM when traffic spikes from 0 to 10,000 users in 1 minute. The autoscaler processes the load eventually, but the first 5 minutes are full of errors. What is the fix?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn