Scaling & Optimization: Handling the Load
·ProfessionalEngineeringCertifications

Scaling & Optimization: Handling the Load

How to survive Black Friday. Learn about Autoscaling, GPU Inference, TF-TRT, and optimizing latency for high-throughput serving.

Performance Engineering

Getting the model to "Running" state is Step 1. Getting it to run fast and cheap is Step 2.

The two metrics you fight are:

  1. Latency: "Time per request" (e.g., 50ms).
  2. Throughput: "Requests per second" (e.g., 10,000 QPS).

1. Autoscaling Strategies

Vertex AI Prediction scales based on CPU/GPU Utilization.

  • Target Utilization: Defaults to 60%.
  • If CPU load > 60%, add a node.
  • If CPU load < 60%, remove a node.

The Cold Start Problem: It takes ~2-3 minutes to spin up a new node.

  • Risk: If traffic spikes instantly (Black Friday start), the scaler is too slow, and users see errors.
  • Fix: Min Replica Count. Set min_replica_count=10 before the event starts to pre-warm the fleet.

2. Hardware Acceleration for Serving

Do you need a GPU for Serving?

  • Recommendation Models (Tabular): No. Use CPU. It's IO-bound, not compute-bound.
  • ResNet/Bert (Vision/Text): Yes. Use T4 GPUs.

NVIDIA TensorRT (TF-TRT): This is a compiler that optimizes TensorFlow graphs specifically for NVIDIA GPUs. It fuses layers and optimizes memory usage.

  • Impact: Can reduce latency by 2x-4x.
  • Action: Convert your SavedModel using TF-TRT before deploying.

3. Server-Side Batching

This is counter-intuitive. Scenario: You receive 100 requests per second. Naive Way: Process 1 at a time. This thrashes the GPU memory bandwidth. Smart Way: Wait 5ms, collect 5 requests into a Batch, and process them all at once.

  • Result: Throughput increases massively. Latency per user increases slightly (by 5ms), but the system doesn't crash.

4. Visualizing Scalability

graph TD
    Traffic[User Requests] --> LB[Load Balancer]
    
    subgraph "Vertex AI Endpoint"
    LB --> Replica1[Replica 1 (CPU &lt; 60%)]
    LB --> Replica2[Replica 2 (CPU &lt; 60%)]
    
    Replica1 -.->|Load > 60%| AutoScaler
    AutoScaler -.->|Events| Replica3[Provision Replica 3]
    end
    
    style AutoScaler fill:#EA4335,stroke:#fff,stroke-width:2px,color:#fff

5. Summary

  • Autoscaling saves money but has a lag time.
  • Min Replicas protects against traffic spikes.
  • TensorRT optimizes Deep Learning models for inference speed.
  • Server-Side Batching improves Throughput at the cost of slight Latency.

In the next lesson, we connect the dots. We move from "Manual Steps" to Pipelines.


Knowledge Check

Error: Quiz options are missing or invalid.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn