
BigQuery ML: Feature Engineering
How to preprocess data using SQL. Learn to use the TRANSFORM clause, ML.Bucketing, ML.Scaling, and One-Hot Encoding directly in BigQuery.
Feature Engineering in SQL?
Yes.
Typically, you do feature engineering in Python (Pandas/Scikit-Learn).
But if your data is 1 Petabyte, Pandas crashes.
BigQuery ML allows you to do "Preprocessing at Scale" using the TRANSFORM clause.
1. The TRANSFORM Clause
This is the most critical BQML concept for the exam.
If you do SELECT log(income) FROM data outside the model, you have to remember to log the income again when you do prediction. This leads to Training-Serving Skew.
If you use TRANSFORM, the logic becomes part of the model artifact. When you predict, you just send raw income, and the model logs it automatically.
CREATE MODEL `dataset.model`
TRANSFORM(
ML.STANDARD_SCALER(income) OVER() AS scaled_income,
ML.BUCKETIZE(age, [10, 20, 30, 40]) OVER() AS age_bucket,
state -- Categorical passed through
)
OPTIONS(...)
AS SELECT income, age, state, label FROM `dataset.table`
2. Automatic vs Manual Preprocessing
Automatic (Default)
If you do nothing, BQML automatically:
- Standardizes numeric columns (Mean=0, Std=1).
- One-Hot Encodes string columns.
- Imputes NULL values (Mean for numeric, "NULL_CATEGORY" for strings).
Manual (Supported Functions)
ML.POLYNOMIAL_EXPAND: Create interaction terms (x1 * x2).ML.FEATURE_CROSS: Create crossed categories ("Country_City").ML.NGRAMS: Extract n-grams from text.ML.HASH_BUCKETIZE: Hash strings for high-cardinality features.
3. Visualizing Feature Crosses
A "Linear Model" cannot learn XOR problems. By crossing features, you allow a linear model to learn non-linear patterns.
Example:
- Lat: 30, Lon: 90 -> "Austin"
- Feature Cross:
CONCAT(Lat, "_", Lon) - The model learns that "30_90" is a high-rent area, which it couldn't learn from just "30" alone.
TRANSFORM(
ML.FEATURE_CROSS(STRUCT(state, zipcode)) as state_zip
)
4. Summary
- Use
TRANSFORMto bake preprocessing into the model. - BQML handles massive scalability for feature engineering.
- Feature Crosses unleash the power of linear models.
Knowledge Check
?Knowledge Check
You are building a BQML model. You used a SQL query to normalize the 'Price' column (`(Price - Mean)/Std`) BEFORE passing it to the `CREATE MODEL` statement. You deploy the model. Six months later, you start getting bad predictions. What happened?