
BigQuery ML: Machine Learning with SQL
Why move data when you can bring the model to the data? Learn to build Classification, Regression, and Time-Series models directly within BigQuery using standard SQL.
The Power of "In-Database" ML
In traditional ML workflows, you have to:
- Query the database.
- Export to CSV.
- Load into Python/Pandas.
- Train the model.
- Upload the model to a server.
This is slow, insecure, and limits you to the RAM on your laptop.
BigQuery ML (BQML) democratizes machine learning by allowing you to create and execute models using standard SQL queries. It runs serverlessly on BigQuery's massive compute cluster.
For the exam, BQML is almost always the right answer if the data is tabular, stored in BigQuery, and the use case is standard (Classification, Regression, Forecasting, Recommendation).
1. Supported Model Types
You don't just get linear regression. BQML is powerful.
| Model Type | SQL Keyword | Use Case |
|---|---|---|
| Linear Regression | LINEAR_REG | Predicting a number (House Price). |
| Logistic Regression | LOGISTIC_REG | Predicting a class (Spam vs. Not Spam). |
| K-Means Clustering | KMEANS | Customer Segmentation (Unsupervised). |
| Matrix Factorization | MATRIX_FACTORIZATION | Product Recommendations. |
| Time Series (ARIMA) | ARIMA_PLUS | Forecasting sales/traffic (handles seasonality). |
| Deep Neural Network | DNN_CLASSIFIER | Complex tabular data requiring hidden layers. |
| AutoML Tables | AUTOML_CLASSIFIER | Best possible accuracy (spins up backend AutoML). |
2. The Syntax: Create, Evaluate, Predict
The lifecycle of a BQML model follows three SQL keywords.
Step 1: CREATE MODEL (Training)
You select your features and label directly from the table.
-- Train a model to predict if a transaction is fraud
CREATE OR REPLACE MODEL `project.dataset.fraud_model`
OPTIONS(
model_type='LOGISTIC_REG',
input_label_cols=['is_fraud']
) AS
SELECT
amount,
merchant_id,
transaction_time,
is_fraud -- This is the label
FROM
`project.dataset.transactions`
WHERE
transaction_time < '2024-01-01' -- Train data
Step 2: ML.EVALUATE (Validation)
Check how well it works.
SELECT
*
FROM
ML.EVALUATE(MODEL `project.dataset.fraud_model`, (
SELECT
amount,
merchant_id,
transaction_time,
is_fraud
FROM
`project.dataset.transactions`
WHERE
transaction_time >= '2024-01-01' -- Test data
))
Output: Returns metrics like precision, recall, accuracy, roc_auc.
Step 3: ML.PREDICT (Serving/Inference)
Generate predictions on new data.
SELECT
*
FROM
ML.PREDICT(MODEL `project.dataset.fraud_model`, (
SELECT
amount,
merchant_id,
transaction_time
FROM
`project.dataset.new_transactions` -- Live data
))
3. Feature Engineering in BQML
Common exam trick: "How do you handle categorical variables in BQML?" Answer: You don't have to do anything. BQML automatically one-hot encodes string columns.
However, you can perform manual preprocessing using the TRANSFORM clause inside the CREATE MODEL statement. This ensures the same logic is applied during Training and Prediction (preventing Training-Serving Skew).
CREATE OR REPLACE MODEL `project.dataset.house_model`
TRANSFORM(
square_feet,
-- Creative Feature Engineering
ML.StandardScaler(price) OVER() as scaled_price,
bucketized_zip_code
)
OPTIONS(...)
AS SELECT ...
4. Visualizing the BQML Flow
graph LR
Data[(BigQuery Tables)] -->|SQL: CREATE MODEL| Engine[BQ Compute Engine]
Engine -->|Train| ModelArtifact[Saved Model Object]
NewData[(New Data)] -->|SQL: ML.PREDICT| ModelArtifact
ModelArtifact -->|Inference| Predictions[(Prediction Table)]
style Engine fill:#4285F4,stroke:#fff,stroke-width:2px,color:#fff
style ModelArtifact fill:#34A853,stroke:#fff,stroke-width:2px,color:#fff
5. Exporting Models
Sometimes you train in BQML but want to serve the model on a website (low latency) or on a mobile device. You can Export BQML models to TensorFlow SavedModel format.
# Terminal command to export
bq extract -m project:dataset.my_model gs://my-bucket/model_dir
- Then you can deploy this artifact to vertex AI Prediction.
6. Summary
- No Data Movement: BQML brings the compute to the storage.
- Speed: Iteration is as fast as running a SQL query.
- Automatic Preprocessing: Handles one-hot encoding, null imputation, and scaling automatically (mostly).
- Versatile: Supports simple regression up to DNNs and ARIMA time series.
In the next lesson, we will look at Pre-trained APIs. What if you don't have any training data?
Knowledge Check
?Knowledge Check
You are using BQML to create a sales forecast for the next 30 days based on historical sales data. The data contains seasonality (sales spike in December). Which model type should you choose OPTIONS?