Model Serving: Vertex AI Prediction
·ProfessionalEngineeringCertifications

Model Serving: Vertex AI Prediction

Batch vs. Online Prediction. How to deploy models to endpoints, manage versions, and optimize for latency.

The "Production" Moment

Training is hard. Serving is harder. If training fails, you try again tomorrow. If serving fails, the website goes down and customers leave.

Vertex AI Prediction is the managed service for serving. It handles:

  1. HTTP/gRPC Endpoints.
  2. Autoscaling (Scale to zero when no one is using it).
  3. Monitoring.

1. Batch vs. Online Prediction

FeatureOnline PredictionBatch Prediction
LatencyMillisecondsHours/Days
InputJSON/HTTP RequestGCS Files / BigQuery Table
OutputJSON ResponseGCS Files / BigQuery Table
Use CaseFraud Detection, Chatbots, SearchWeekly Sales Forecast, Sentiment analysis of archived emails
Cost24/7 Server CostPay only for the duration of the job

Exam Tip: If the scenario says "Immediate response" -> Online. If it says "Process overnight" or "Cost optimized" -> Batch.


2. Pre-built vs. Custom Containers

Just like Training, Serving supports:

  • Pre-built Containers: Upload your saved_model.pb. Vertex AI executes it using a standard TensorFlow Serving image. (Easy).
  • Custom Containers: Use Flask/FastAPI inside Docker. (Flexible. Essential if you have custom C++ dependencies).

3. Traffic Splitting (Canary Deployment)

You have Model v1 serving 100% traffic. You trained Model v2. Do not just swap them. If v2 is broken, you crash the system.

Vertex AI Endpoints allow Traffic Splitting.

  • Route 90% to v1.
  • Route 10% to v2.
  • Monitor v2 errors.
  • If good, roll out to 100%.

4. Code Example: Deploying a Model

from google.cloud import aiplatform

# 1. Upload the Model Artifact
model = aiplatform.Model.upload(
    display_name="my-churn-model",
    artifact_uri="gs://my-bucket/model-output/",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-12:latest",
)

# 2. Create an Endpoint
endpoint = aiplatform.Endpoint.create(display_name="churn-endpoint")

# 3. Deploy Model to Endpoint (with traffic split)
model.deploy(
    endpoint=endpoint,
    machine_type="n1-standard-4",
    min_replica_count=1,
    max_replica_count=10, # Autoscaling
    traffic_percentage=100,
)

print(f"Prediction URL: {endpoint.resource_name}")

5. Summary

  • Batch: Cheap, slow, bulk.
  • Online: Fast, expensive, real-time.
  • Endpoints: The abstraction that allows Traffic Splitting between multiple Model Versions.

In the next lesson, we handle the scalability. What happens on Black Friday? Scaling Online Serving.


Knowledge Check

Error: Quiz options are missing or invalid.

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn