Performance Engineering

Getting the model to "Running" state is Step 1. Getting it to run fast and cheap is Step 2.

The two metrics you fight are:

Latency: "Time per request" (e.g., 50ms).
Throughput: "Requests per second" (e.g., 10,000 QPS).

1. Autoscaling Strategies

Vertex AI Prediction scales based on CPU/GPU Utilization.

Target Utilization: Defaults to 60%.
If CPU load > 60%, add a node.
If CPU load < 60%, remove a node.

The Cold Start Problem: It takes ~2-3 minutes to spin up a new node.

Risk: If traffic spikes instantly (Black Friday start), the scaler is too slow, and users see errors.
Fix: Min Replica Count. Set min_replica_count=10 before the event starts to pre-warm the fleet.

2. Hardware Acceleration for Serving

Do you need a GPU for Serving?

Recommendation Models (Tabular): No. Use CPU. It's IO-bound, not compute-bound.
ResNet/Bert (Vision/Text): Yes. Use T4 GPUs.

NVIDIA TensorRT (TF-TRT): This is a compiler that optimizes TensorFlow graphs specifically for NVIDIA GPUs. It fuses layers and optimizes memory usage.

Impact: Can reduce latency by 2x-4x.
Action: Convert your SavedModel using TF-TRT before deploying.

3. Server-Side Batching

This is counter-intuitive. Scenario: You receive 100 requests per second. Naive Way: Process 1 at a time. This thrashes the GPU memory bandwidth. Smart Way: Wait 5ms, collect 5 requests into a Batch, and process them all at once.

Result: Throughput increases massively. Latency per user increases slightly (by 5ms), but the system doesn't crash.

4. Visualizing Scalability

graph TD
    Traffic[User Requests] --> LB[Load Balancer]
    
    subgraph "Vertex AI Endpoint"
    LB --> Replica1[Replica 1 (CPU &lt; 60%)]
    LB --> Replica2[Replica 2 (CPU &lt; 60%)]
    
    Replica1 -.->|Load > 60%| AutoScaler
    AutoScaler -.->|Events| Replica3[Provision Replica 3]
    end
    
    style AutoScaler fill:#EA4335,stroke:#fff,stroke-width:2px,color:#fff

5. Summary

Autoscaling saves money but has a lag time.
Min Replicas protects against traffic spikes.
TensorRT optimizes Deep Learning models for inference speed.
Server-Side Batching improves Throughput at the cost of slight Latency.

In the next lesson, we connect the dots. We move from "Manual Steps" to Pipelines.

Knowledge Check

Error: Quiz options are missing or invalid.

Scaling & Optimization: Handling the Load