
Model Serving: Vertex AI Prediction
Batch vs. Online Prediction. How to deploy models to endpoints, manage versions, and optimize for latency.
The "Production" Moment
Training is hard. Serving is harder. If training fails, you try again tomorrow. If serving fails, the website goes down and customers leave.
Vertex AI Prediction is the managed service for serving. It handles:
- HTTP/gRPC Endpoints.
- Autoscaling (Scale to zero when no one is using it).
- Monitoring.
1. Batch vs. Online Prediction
| Feature | Online Prediction | Batch Prediction |
|---|---|---|
| Latency | Milliseconds | Hours/Days |
| Input | JSON/HTTP Request | GCS Files / BigQuery Table |
| Output | JSON Response | GCS Files / BigQuery Table |
| Use Case | Fraud Detection, Chatbots, Search | Weekly Sales Forecast, Sentiment analysis of archived emails |
| Cost | 24/7 Server Cost | Pay only for the duration of the job |
Exam Tip: If the scenario says "Immediate response" -> Online. If it says "Process overnight" or "Cost optimized" -> Batch.
2. Pre-built vs. Custom Containers
Just like Training, Serving supports:
- Pre-built Containers: Upload your
saved_model.pb. Vertex AI executes it using a standard TensorFlow Serving image. (Easy). - Custom Containers: Use Flask/FastAPI inside Docker. (Flexible. Essential if you have custom C++ dependencies).
3. Traffic Splitting (Canary Deployment)
You have Model v1 serving 100% traffic. You trained Model v2.
Do not just swap them. If v2 is broken, you crash the system.
Vertex AI Endpoints allow Traffic Splitting.
- Route 90% to
v1. - Route 10% to
v2. - Monitor
v2errors. - If good, roll out to 100%.
4. Code Example: Deploying a Model
from google.cloud import aiplatform
# 1. Upload the Model Artifact
model = aiplatform.Model.upload(
display_name="my-churn-model",
artifact_uri="gs://my-bucket/model-output/",
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-12:latest",
)
# 2. Create an Endpoint
endpoint = aiplatform.Endpoint.create(display_name="churn-endpoint")
# 3. Deploy Model to Endpoint (with traffic split)
model.deploy(
endpoint=endpoint,
machine_type="n1-standard-4",
min_replica_count=1,
max_replica_count=10, # Autoscaling
traffic_percentage=100,
)
print(f"Prediction URL: {endpoint.resource_name}")
5. Summary
- Batch: Cheap, slow, bulk.
- Online: Fast, expensive, real-time.
- Endpoints: The abstraction that allows Traffic Splitting between multiple Model Versions.
In the next lesson, we handle the scalability. What happens on Black Friday? Scaling Online Serving.
Knowledge Check
Error: Quiz options are missing or invalid.