
Data Transformation: Cleaning & TF Transform
Dataflow is the engine, but what logic goes inside? Learn the difference between Instance-Level vs Full-Pass transformations and how to use TensorFlow Transform (TFT) to prevent skew.
Cleaning the Mess
Raw data is never ready for ML.
- Images: Different sizes.
- Text: HTML tags, casing.
- Numbers: Different scales (0-1 vs 0-1000).
1. Instance-Level vs. Full-Pass
This distinction determines which tool you use.
Instance-Level Transformation (The Easy One)
- Definition: You can process this row without seeing any other row.
- Examples:
Resize Image,Lowercase Text,Cast String to Float. - Tool: Dataflow
Mapfunction, BigQuery SQL.
Full-Pass Transformation (The Hard One)
- Definition: You need to calculate statistics across the entire dataset first.
- Examples:
Standard Scaling(Needs Mean/Std of all rows),Vocabulary(Needs list of all unique words). - Tool: TensorFlow Transform (TFT).
2. The Problem with Full-Pass & Skew
If you calculate the Mean of the training set as 50, you must use 50 to normalize the Serving data forever.
If you recalculate the Mean on the Serving request (which is just 1 row), the Mean is valid only for that row, wrecking the model.
TensorFlow Transform (TFT) solves this.
- Analyze Phase: It runs a Dataflow job to calculate the global stats (Mean, Vocab).
- Transform Phase: It applies the stats to create the training data.
- Graph Surgery: It saves the stats (as constants) into the TensorFlow Graph.
- Result: The exported model expects Raw Data. It does the normalization internally using the frozen stats.
3. Code Example: TFT
import tensorflow_transform as tft
def preprocessing_fn(inputs):
outputs = {}
# Instance Level
outputs['lowercase_city'] = tf.strings.lower(inputs['city'])
# Full Pass (Requires Analysis of whole dataset)
# tft.scale_to_z_score calculates Mean/Std automatically
outputs['normalized_age'] = tft.scale_to_z_score(inputs['age'])
# Full Pass
outputs['city_id'] = tft.compute_and_apply_vocabulary(inputs['city'])
return outputs
4. Visual Inspection (Dataprep)
If you don't know what's wrong with your data, use Dataprep by Trifacta.
- Visual: Click-and-drag UI.
- Discovery: It auto-detects outliers ("Hey, this Date format is different").
- Output: It compiles your clicks into a Dataflow job.
Exam Tip: Use Dataprep for Exploration and Ad-hoc cleaning. Use Dataflow/TFT for Automated Pipelines.
Knowledge Check
?Knowledge Check
You are building a model that requires Z-Score normalization `(x - mean) / std`. You decide to calculate the mean and std in a Python script and hard-code them into your serving application. What is the risk?