
Scaling & Optimization: Handling the Load
How to survive Black Friday. Learn about Autoscaling, GPU Inference, TF-TRT, and optimizing latency for high-throughput serving.
Performance Engineering
Getting the model to "Running" state is Step 1. Getting it to run fast and cheap is Step 2.
The two metrics you fight are:
- Latency: "Time per request" (e.g., 50ms).
- Throughput: "Requests per second" (e.g., 10,000 QPS).
1. Autoscaling Strategies
Vertex AI Prediction scales based on CPU/GPU Utilization.
- Target Utilization: Defaults to 60%.
- If CPU load > 60%, add a node.
- If CPU load < 60%, remove a node.
The Cold Start Problem: It takes ~2-3 minutes to spin up a new node.
- Risk: If traffic spikes instantly (Black Friday start), the scaler is too slow, and users see errors.
- Fix: Min Replica Count. Set
min_replica_count=10before the event starts to pre-warm the fleet.
2. Hardware Acceleration for Serving
Do you need a GPU for Serving?
- Recommendation Models (Tabular): No. Use CPU. It's IO-bound, not compute-bound.
- ResNet/Bert (Vision/Text): Yes. Use T4 GPUs.
NVIDIA TensorRT (TF-TRT): This is a compiler that optimizes TensorFlow graphs specifically for NVIDIA GPUs. It fuses layers and optimizes memory usage.
- Impact: Can reduce latency by 2x-4x.
- Action: Convert your
SavedModelusing TF-TRT before deploying.
3. Server-Side Batching
This is counter-intuitive. Scenario: You receive 100 requests per second. Naive Way: Process 1 at a time. This thrashes the GPU memory bandwidth. Smart Way: Wait 5ms, collect 5 requests into a Batch, and process them all at once.
- Result: Throughput increases massively. Latency per user increases slightly (by 5ms), but the system doesn't crash.
4. Visualizing Scalability
graph TD
Traffic[User Requests] --> LB[Load Balancer]
subgraph "Vertex AI Endpoint"
LB --> Replica1[Replica 1 (CPU < 60%)]
LB --> Replica2[Replica 2 (CPU < 60%)]
Replica1 -.->|Load > 60%| AutoScaler
AutoScaler -.->|Events| Replica3[Provision Replica 3]
end
style AutoScaler fill:#EA4335,stroke:#fff,stroke-width:2px,color:#fff
5. Summary
- Autoscaling saves money but has a lag time.
- Min Replicas protects against traffic spikes.
- TensorRT optimizes Deep Learning models for inference speed.
- Server-Side Batching improves Throughput at the cost of slight Latency.
In the next lesson, we connect the dots. We move from "Manual Steps" to Pipelines.
Knowledge Check
?Knowledge Check
Your model endpoint crashes every day at 9:00 AM when traffic spikes from 0 to 10,000 users in 1 minute. The autoscaler processes the load eventually, but the first 5 minutes are full of errors. What is the fix?