The Power of "In-Database" ML

In traditional ML workflows, you have to:

Query the database.
Export to CSV.
Load into Python/Pandas.
Train the model.
Upload the model to a server.

This is slow, insecure, and limits you to the RAM on your laptop.

BigQuery ML (BQML) democratizes machine learning by allowing you to create and execute models using standard SQL queries. It runs serverlessly on BigQuery's massive compute cluster.

For the exam, BQML is almost always the right answer if the data is tabular, stored in BigQuery, and the use case is standard (Classification, Regression, Forecasting, Recommendation).

1. Supported Model Types

You don't just get linear regression. BQML is powerful.

Model Type	SQL Keyword	Use Case
Linear Regression	`LINEAR_REG`	Predicting a number (House Price).
Logistic Regression	`LOGISTIC_REG`	Predicting a class (Spam vs. Not Spam).
K-Means Clustering	`KMEANS`	Customer Segmentation (Unsupervised).
Matrix Factorization	`MATRIX_FACTORIZATION`	Product Recommendations.
Time Series (ARIMA)	`ARIMA_PLUS`	Forecasting sales/traffic (handles seasonality).
Deep Neural Network	`DNN_CLASSIFIER`	Complex tabular data requiring hidden layers.
AutoML Tables	`AUTOML_CLASSIFIER`	Best possible accuracy (spins up backend AutoML).

2. The Syntax: Create, Evaluate, Predict

The lifecycle of a BQML model follows three SQL keywords.

Step 1: CREATE MODEL (Training)

You select your features and label directly from the table.

-- Train a model to predict if a transaction is fraud
CREATE OR REPLACE MODEL `project.dataset.fraud_model`
OPTIONS(
  model_type='LOGISTIC_REG',
  input_label_cols=['is_fraud']
) AS
SELECT
  amount,
  merchant_id,
  transaction_time,
  is_fraud  -- This is the label
FROM
  `project.dataset.transactions`
WHERE
  transaction_time &lt; '2024-01-01' -- Train data

Step 2: ML.EVALUATE (Validation)

Check how well it works.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `project.dataset.fraud_model`, (
    SELECT
      amount,
      merchant_id,
      transaction_time,
      is_fraud
    FROM
      `project.dataset.transactions`
    WHERE
      transaction_time >= '2024-01-01' -- Test data
  ))

Output: Returns metrics like precision, recall, accuracy, roc_auc.

Step 3: ML.PREDICT (Serving/Inference)

Generate predictions on new data.

SELECT
  *
FROM
  ML.PREDICT(MODEL `project.dataset.fraud_model`, (
    SELECT
      amount,
      merchant_id,
      transaction_time
    FROM
      `project.dataset.new_transactions` -- Live data
  ))

3. Feature Engineering in BQML

Common exam trick: "How do you handle categorical variables in BQML?" Answer: You don't have to do anything. BQML automatically one-hot encodes string columns.

However, you can perform manual preprocessing using the TRANSFORM clause inside the CREATE MODEL statement. This ensures the same logic is applied during Training and Prediction (preventing Training-Serving Skew).

CREATE OR REPLACE MODEL `project.dataset.house_model`
TRANSFORM(
  square_feet,
  -- Creative Feature Engineering
  ML.StandardScaler(price) OVER() as scaled_price,
  bucketized_zip_code
)
OPTIONS(...)
AS SELECT ...

4. Visualizing the BQML Flow

graph LR
    Data[(BigQuery Tables)] -->|SQL: CREATE MODEL| Engine[BQ Compute Engine]
    Engine -->|Train| ModelArtifact[Saved Model Object]
    
    NewData[(New Data)] -->|SQL: ML.PREDICT| ModelArtifact
    ModelArtifact -->|Inference| Predictions[(Prediction Table)]
    
    style Engine fill:#4285F4,stroke:#fff,stroke-width:2px,color:#fff
    style ModelArtifact fill:#34A853,stroke:#fff,stroke-width:2px,color:#fff

5. Exporting Models

Sometimes you train in BQML but want to serve the model on a website (low latency) or on a mobile device. You can Export BQML models to TensorFlow SavedModel format.

# Terminal command to export
bq extract -m project:dataset.my_model gs://my-bucket/model_dir

Then you can deploy this artifact to vertex AI Prediction.

6. Summary

No Data Movement: BQML brings the compute to the storage.
Speed: Iteration is as fast as running a SQL query.
Automatic Preprocessing: Handles one-hot encoding, null imputation, and scaling automatically (mostly).
Versatile: Supports simple regression up to DNNs and ARIMA time series.

In the next lesson, we will look at Pre-trained APIs. What if you don't have any training data?

Knowledge Check

Error: Quiz options are missing or invalid.

BigQuery ML: Machine Learning with SQL