
Online vs Batch: Choosing the Pattern
The Architecture Decision. When to use HTTP prediction vs batch jobs, and how to handle cost/latency trade-offs.
The Latency Equation
The first question on Serving is: "Does the user need the answer NOW?"
1. Online Prediction (Synchronous)
- Experience: User clicks "Search". 100ms later, results appear.
- Tech: Vertex AI Endpoints (REST/gRPC).
- Cost: You pay for the node 24/7 (unless scaled to zero).
- Format: JSON/Payload.
2. Batch Prediction (Asynchronous)
- Experience: Marketing team wants to "Score all 10 million users for Churn Risk" every Sunday night.
- Tech: Vertex AI Batch Prediction Job.
- Cost: You pay only for the minutes the job runs.
- Format: GCS Files (CSV/JSONL) or BigQuery Tables.
3. Streaming Prediction (The Hybrid)
- Experience: IoT Sensor sends data via Pub/Sub. We need to detect anomaly.
- Tech: Dataflow + Vertex AI Endpoint.
- Pattern: Dataflow reads Pub/Sub, batches records (micro-batch), calls Vertex Endpoint, writes result to BigQuery.
Knowledge Check
?Knowledge Check
A retail company wants to recommend products to users. They have 100 million users. The recommendations only need to be updated once a day based on yesterday's browsing history. Which serving pattern minimizes cost?