The "Production" Moment

Training is hard. Serving is harder. If training fails, you try again tomorrow. If serving fails, the website goes down and customers leave.

Vertex AI Prediction is the managed service for serving. It handles:

HTTP/gRPC Endpoints.
Autoscaling (Scale to zero when no one is using it).
Monitoring.

1. Batch vs. Online Prediction

Feature	Online Prediction	Batch Prediction
Latency	Milliseconds	Hours/Days
Input	JSON/HTTP Request	GCS Files / BigQuery Table
Output	JSON Response	GCS Files / BigQuery Table
Use Case	Fraud Detection, Chatbots, Search	Weekly Sales Forecast, Sentiment analysis of archived emails
Cost	24/7 Server Cost	Pay only for the duration of the job

Exam Tip: If the scenario says "Immediate response" -> Online. If it says "Process overnight" or "Cost optimized" -> Batch.

2. Pre-built vs. Custom Containers

Just like Training, Serving supports:

Pre-built Containers: Upload your saved_model.pb. Vertex AI executes it using a standard TensorFlow Serving image. (Easy).
Custom Containers: Use Flask/FastAPI inside Docker. (Flexible. Essential if you have custom C++ dependencies).

3. Traffic Splitting (Canary Deployment)

You have Model v1 serving 100% traffic. You trained Model v2. Do not just swap them. If v2 is broken, you crash the system.

Vertex AI Endpoints allow Traffic Splitting.

Route 90% to v1.
Route 10% to v2.
Monitor v2 errors.
If good, roll out to 100%.

4. Code Example: Deploying a Model

from google.cloud import aiplatform

# 1. Upload the Model Artifact
model = aiplatform.Model.upload(
    display_name="my-churn-model",
    artifact_uri="gs://my-bucket/model-output/",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-12:latest",
)

# 2. Create an Endpoint
endpoint = aiplatform.Endpoint.create(display_name="churn-endpoint")

# 3. Deploy Model to Endpoint (with traffic split)
model.deploy(
    endpoint=endpoint,
    machine_type="n1-standard-4",
    min_replica_count=1,
    max_replica_count=10, # Autoscaling
    traffic_percentage=100,
)

print(f"Prediction URL: {endpoint.resource_name}")

5. Summary

Batch: Cheap, slow, bulk.
Online: Fast, expensive, real-time.
Endpoints: The abstraction that allows Traffic Splitting between multiple Model Versions.

In the next lesson, we handle the scalability. What happens on Black Friday? Scaling Online Serving.

Knowledge Check

Error: Quiz options are missing or invalid.

Model Serving: Vertex AI Prediction