BigQuery ML: Machine Learning with SQL

BigQuery ML: Machine Learning with SQL

Why move data when you can bring the model to the data? Learn to build Classification, Regression, and Time-Series models directly within BigQuery using standard SQL.

The Power of "In-Database" ML

In traditional ML workflows, you have to:

  1. Query the database.
  2. Export to CSV.
  3. Load into Python/Pandas.
  4. Train the model.
  5. Upload the model to a server.

This is slow, insecure, and limits you to the RAM on your laptop.

BigQuery ML (BQML) democratizes machine learning by allowing you to create and execute models using standard SQL queries. It runs serverlessly on BigQuery's massive compute cluster.

For the exam, BQML is almost always the right answer if the data is tabular, stored in BigQuery, and the use case is standard (Classification, Regression, Forecasting, Recommendation).


1. Supported Model Types

You don't just get linear regression. BQML is powerful.

Model TypeSQL KeywordUse Case
Linear RegressionLINEAR_REGPredicting a number (House Price).
Logistic RegressionLOGISTIC_REGPredicting a class (Spam vs. Not Spam).
K-Means ClusteringKMEANSCustomer Segmentation (Unsupervised).
Matrix FactorizationMATRIX_FACTORIZATIONProduct Recommendations.
Time Series (ARIMA)ARIMA_PLUSForecasting sales/traffic (handles seasonality).
Deep Neural NetworkDNN_CLASSIFIERComplex tabular data requiring hidden layers.
AutoML TablesAUTOML_CLASSIFIERBest possible accuracy (spins up backend AutoML).

2. The Syntax: Create, Evaluate, Predict

The lifecycle of a BQML model follows three SQL keywords.

Step 1: CREATE MODEL (Training)

You select your features and label directly from the table.

-- Train a model to predict if a transaction is fraud
CREATE OR REPLACE MODEL `project.dataset.fraud_model`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['is_fraud']
) AS
SELECT
  amount,
  merchant_id,
  transaction_time,
  is_fraud  -- This is the label
FROM
  `project.dataset.transactions`
WHERE
  transaction_time < '2024-01-01' -- Train data

Step 2: ML.EVALUATE (Validation)

Check how well it works.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `project.dataset.fraud_model`, (
    SELECT
      amount,
      merchant_id,
      transaction_time,
      is_fraud
    FROM
      `project.dataset.transactions`
    WHERE
      transaction_time >= '2024-01-01' -- Test data
  ))

Output: Returns metrics like precision, recall, accuracy, roc_auc.

Step 3: ML.PREDICT (Serving/Inference)

Generate predictions on new data.

SELECT
  *
FROM
  ML.PREDICT(MODEL `project.dataset.fraud_model`, (
    SELECT
      amount,
      merchant_id,
      transaction_time
    FROM
      `project.dataset.new_transactions` -- Live data
  ))

3. Feature Engineering in BQML

Common exam trick: "How do you handle categorical variables in BQML?" Answer: You don't have to do anything. BQML automatically one-hot encodes string columns.

However, you can perform manual preprocessing using the TRANSFORM clause inside the CREATE MODEL statement. This ensures the same logic is applied during Training and Prediction (preventing Training-Serving Skew).

CREATE OR REPLACE MODEL `project.dataset.house_model`
TRANSFORM(
  square_feet,
  -- Creative Feature Engineering
  ML.StandardScaler(price) OVER() as scaled_price,
  bucketized_zip_code
)
OPTIONS(...)
AS SELECT ...

4. Visualizing the BQML Flow

graph LR
    Data[(BigQuery Tables)] -->|SQL: CREATE MODEL| Engine[BQ Compute Engine]
    Engine -->|Train| ModelArtifact[Saved Model Object]
    
    NewData[(New Data)] -->|SQL: ML.PREDICT| ModelArtifact
    ModelArtifact -->|Inference| Predictions[(Prediction Table)]
    
    style Engine fill:#4285F4,stroke:#fff,stroke-width:2px,color:#fff
    style ModelArtifact fill:#34A853,stroke:#fff,stroke-width:2px,color:#fff

5. Exporting Models

Sometimes you train in BQML but want to serve the model on a website (low latency) or on a mobile device. You can Export BQML models to TensorFlow SavedModel format.

# Terminal command to export
bq extract -m project:dataset.my_model gs://my-bucket/model_dir
  • Then you can deploy this artifact to vertex AI Prediction.

6. Summary

  • No Data Movement: BQML brings the compute to the storage.
  • Speed: Iteration is as fast as running a SQL query.
  • Automatic Preprocessing: Handles one-hot encoding, null imputation, and scaling automatically (mostly).
  • Versatile: Supports simple regression up to DNNs and ARIMA time series.

In the next lesson, we will look at Pre-trained APIs. What if you don't have any training data?


Knowledge Check

?Knowledge Check

You are using BQML to create a sales forecast for the next 30 days based on historical sales data. The data contains seasonality (sales spike in December). Which model type should you choose OPTIONS?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn