Data Preparation at Scale: Dataflow & Vertex AI

Data Preparation at Scale: Dataflow & Vertex AI

Data is 80% of ML. Learn how to execute ETL pipelines using BigQuery and Dataflow, and how to manage features using Vertex AI Feature Store.

The Fuel of ML

You cannot train a model on "messy" data.

  • Missing Values: (Null vs 0).
  • Outliers: (Age = 200).
  • Format: (Images in JPEG, PNG, inconsistent sizes).

For the exam, you need to know which tool to use to clean data based on the size and type of the data.


1. Tool Selection Matrix

ToolData SizeData TypeSkill RequiredExam Keywords
BigQueryPetabytesStructured (SQL)SQL"Pre-processing in database", "Batch"
DataflowPetabytesStreaming / UnstructuredJava/Python (Apache Beam)"Windowing", "Real-time", "Streaming"
DataprepGigabytesStructured (CSV/JSON)UI (No Code)"Visual extraction", "Business Analyst"
Spark (Dataproc)PetabytesLegacy MapReducePython/Scala"Lift and shift", "Hadoop migration"

Rule: If it's streaming -> Dataflow. If it's huge SQL data -> BigQuery.


2. Apache Beam & Dataflow

Dataflow is Google's fully managed service for running Apache Beam pipelines. It allows you to write a pipeline once and run it in Batch or Streaming mode.

Key Concept: Windowing

When processing a stream of infinite data (IoT sensors), you can't calculate an "Average" unless you define a time window.

  • Fixed Window: "Average per minute."
  • Sliding Window: "Average of the last 5 minutes, updated every minute."
  • Session Window: "Average during a user's active session (until they stop clicking)."

Warning: Training models on streaming data usually requires a Training-Serving Skew check. You must ensure the logic used to create features in Dataflow (Training) is exactly the same as the logic used in prediction.


3. Vertex AI Feature Store

This is a critical exam topic. Problem:

  • Team A calculates "Average Purchase Value" for their model using Method X.
  • Team B calculates "Average Purchase Value" using Method Y.
  • Result: Confusion and duplicated compute cost.

Solution: The Feature Store.

  • It is a centralized repository for ML features.
  • Online Store: Low latency (Redis-like) for serving real-time predictions.
  • Offline Store: High capacity (BigQuery-like) for fetching historical training batches.

Point-in-Time Correctness

The Feature Store solves "Data Leakage" by letting you ask: "What was the user's churn risk value AS OF last Tuesday?" It reconstructs the past state perfectly.


4. Code Example: TensorFlow Transform (TFT)

For deep learning, you often preprocess inside the graph. tft works with Dataflow to analyze the data (find min/max) and then apply it.

import tensorflow_transform as tft

def preprocessing_fn(inputs):
    """Callback function for TFT"""
    outputs = {}
    
    # Scale numerical feature to 0-1
    outputs['scaled_age'] = tft.scale_to_0_1(inputs['age'])
    
    # Create a vocabulary for categorical feature
    outputs['city_id'] = tft.compute_and_apply_vocabulary(inputs['city'])
    
    return outputs

Why usage TFT? Because the "artifacts" (like the vocabulary list or the min/max values) are saved with the model. When you deploy the model, it remembers that "Age 50" scales to "0.5". You don't need to manually scale inputs at prediction time.


5. Summary

  • Dataflow is the heavy lifter for streaming and complex ETL.
  • Dataprep is for visual cleaning (small scale).
  • Feature Store ensures consistency between Training (Offline) and Serving (Online).
  • TF Transform prevents skew by baking preprocessing logic into the model graph.

In the next lesson, we start writing code. Vertex AI Workbench.


Knowledge Check

?Knowledge Check

You are designing a fraud detection system. You need to verify if a credit card transactionAmount is significantly higher than the user's average spending over the last 10 minutes. The transaction stream arrives via Pub/Sub. Which tool should you use to calculate this 'rolling average' feature in real-time?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn