The Fuel of ML

You cannot train a model on "messy" data.

Missing Values: (Null vs 0).
Outliers: (Age = 200).
Format: (Images in JPEG, PNG, inconsistent sizes).

For the exam, you need to know which tool to use to clean data based on the size and type of the data.

1. Tool Selection Matrix

Tool	Data Size	Data Type	Skill Required	Exam Keywords
BigQuery	Petabytes	Structured (SQL)	SQL	"Pre-processing in database", "Batch"
Dataflow	Petabytes	Streaming / Unstructured	Java/Python (Apache Beam)	"Windowing", "Real-time", "Streaming"
Dataprep	Gigabytes	Structured (CSV/JSON)	UI (No Code)	"Visual extraction", "Business Analyst"
Spark (Dataproc)	Petabytes	Legacy MapReduce	Python/Scala	"Lift and shift", "Hadoop migration"

Rule: If it's streaming -> Dataflow. If it's huge SQL data -> BigQuery.

2. Apache Beam & Dataflow

Dataflow is Google's fully managed service for running Apache Beam pipelines. It allows you to write a pipeline once and run it in Batch or Streaming mode.

Key Concept: Windowing

When processing a stream of infinite data (IoT sensors), you can't calculate an "Average" unless you define a time window.

Fixed Window: "Average per minute."
Sliding Window: "Average of the last 5 minutes, updated every minute."
Session Window: "Average during a user's active session (until they stop clicking)."

Warning: Training models on streaming data usually requires a Training-Serving Skew check. You must ensure the logic used to create features in Dataflow (Training) is exactly the same as the logic used in prediction.

3. Vertex AI Feature Store

This is a critical exam topic. Problem:

Team A calculates "Average Purchase Value" for their model using Method X.
Team B calculates "Average Purchase Value" using Method Y.
Result: Confusion and duplicated compute cost.

Solution: The Feature Store.

It is a centralized repository for ML features.
Online Store: Low latency (Redis-like) for serving real-time predictions.
Offline Store: High capacity (BigQuery-like) for fetching historical training batches.

Point-in-Time Correctness

The Feature Store solves "Data Leakage" by letting you ask: "What was the user's churn risk value AS OF last Tuesday?" It reconstructs the past state perfectly.

4. Code Example: TensorFlow Transform (TFT)

For deep learning, you often preprocess inside the graph. tft works with Dataflow to analyze the data (find min/max) and then apply it.

import tensorflow_transform as tft

def preprocessing_fn(inputs):
    """Callback function for TFT"""
    outputs = {}
    
    # Scale numerical feature to 0-1
    outputs['scaled_age'] = tft.scale_to_0_1(inputs['age'])
    
    # Create a vocabulary for categorical feature
    outputs['city_id'] = tft.compute_and_apply_vocabulary(inputs['city'])
    
    return outputs

Why usage TFT? Because the "artifacts" (like the vocabulary list or the min/max values) are saved with the model. When you deploy the model, it remembers that "Age 50" scales to "0.5". You don't need to manually scale inputs at prediction time.

5. Summary

Dataflow is the heavy lifter for streaming and complex ETL.
Dataprep is for visual cleaning (small scale).
Feature Store ensures consistency between Training (Offline) and Serving (Online).
TF Transform prevents skew by baking preprocessing logic into the model graph.

In the next lesson, we start writing code. Vertex AI Workbench.

Knowledge Check

Error: Quiz options are missing or invalid.

Data Preparation at Scale: Dataflow & Vertex AI