Data Transformation: Cleaning & TF Transform

Data Transformation: Cleaning & TF Transform

Dataflow is the engine, but what logic goes inside? Learn the difference between Instance-Level vs Full-Pass transformations and how to use TensorFlow Transform (TFT) to prevent skew.

Cleaning the Mess

Raw data is never ready for ML.

  • Images: Different sizes.
  • Text: HTML tags, casing.
  • Numbers: Different scales (0-1 vs 0-1000).

1. Instance-Level vs. Full-Pass

This distinction determines which tool you use.

Instance-Level Transformation (The Easy One)

  • Definition: You can process this row without seeing any other row.
  • Examples: Resize Image, Lowercase Text, Cast String to Float.
  • Tool: Dataflow Map function, BigQuery SQL.

Full-Pass Transformation (The Hard One)

  • Definition: You need to calculate statistics across the entire dataset first.
  • Examples: Standard Scaling (Needs Mean/Std of all rows), Vocabulary (Needs list of all unique words).
  • Tool: TensorFlow Transform (TFT).

2. The Problem with Full-Pass & Skew

If you calculate the Mean of the training set as 50, you must use 50 to normalize the Serving data forever. If you recalculate the Mean on the Serving request (which is just 1 row), the Mean is valid only for that row, wrecking the model.

TensorFlow Transform (TFT) solves this.

  1. Analyze Phase: It runs a Dataflow job to calculate the global stats (Mean, Vocab).
  2. Transform Phase: It applies the stats to create the training data.
  3. Graph Surgery: It saves the stats (as constants) into the TensorFlow Graph.
  4. Result: The exported model expects Raw Data. It does the normalization internally using the frozen stats.

3. Code Example: TFT

import tensorflow_transform as tft

def preprocessing_fn(inputs):
    outputs = {}
    
    # Instance Level
    outputs['lowercase_city'] = tf.strings.lower(inputs['city'])
    
    # Full Pass (Requires Analysis of whole dataset)
    # tft.scale_to_z_score calculates Mean/Std automatically
    outputs['normalized_age'] = tft.scale_to_z_score(inputs['age'])
    
    # Full Pass
    outputs['city_id'] = tft.compute_and_apply_vocabulary(inputs['city'])
    
    return outputs

4. Visual Inspection (Dataprep)

If you don't know what's wrong with your data, use Dataprep by Trifacta.

  • Visual: Click-and-drag UI.
  • Discovery: It auto-detects outliers ("Hey, this Date format is different").
  • Output: It compiles your clicks into a Dataflow job.

Exam Tip: Use Dataprep for Exploration and Ad-hoc cleaning. Use Dataflow/TFT for Automated Pipelines.


Knowledge Check

?Knowledge Check

You are building a model that requires Z-Score normalization `(x - mean) / std`. You decide to calculate the mean and std in a Python script and hard-code them into your serving application. What is the risk?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn