The "Middle Ground" Framework

We have seen standard APIs (Lesson 2.2) and SQL-based ML (Lesson 2.1). AutoML is the solution when:

APIs fail: The pre-trained model doesn't know your specific data (e.g., distinguishing between a "Healthy Leaf" and a "Diseased Leaf").
Custom Code is too hard: You don't have a team of PhDs to tweak hyperparameters for a month.

AutoML (now part of Vertex AI) uses Google's Neural Architecture Search (NAS) to automatically find the best model architecture for your data.

1. Supported Data Types in Vertex AI AutoML

Data Type	Task	Example
Image	Classification, Object Detection	Finding defects in manufacturing parts.
Video	Classification, Object Tracking	Identifying a specific player in a soccer game.
Text	Classification, Entity Extraction, Sentiment	Classifying legal contracts by "Jurisdiction".
Tabular	Classification, Regression, Forecasting	Predicting customer churn (Rows & Columns).

Note: AutoML Tabular is famously powerful. It frequently beats human data scientists in Kaggle competitions because it automatically ensembles models (GBDT, Deep, Cross-networks).

2. The AutoML Workflow

The exam tests you on the process, particularly the data requirements.

graph TD
    Data[Raw Data] --> Label[Labeling (Ground Truth)]
    Label --> Import[Dataset Import (Vertex AI)]
    Import --> Train[AutoML Training Job]
    Train --> Eval[Evaluation Metrics]
    Eval --> Deploy[Deploy to Endpoint]
    
    style Train fill:#4285F4,stroke:#fff,stroke-width:2px,color:#fff

Step 1: Data Preparation & Labeling

Split: Vertex AI automatically splits your data (80% Train, 10% Validate, 10% Test) unless you specify a Manual Split column.
Minimums:
- Vision: 100 images per label (minimum), 1000+ (recommended).
- Tabular: 1000 rows minimum.

Step 2: Training Budget

You don't choose "Learning Rate." You choose "Node Hours."

"Run for maximum 2 hours."
Vertex AI will try hundreds of architectures. If it finds a great one in 1 hour, it stops early (Early Stopping) to save you money.

Step 3: Evaluation

Vertex AI generates a dashboard with:

Confusion Matrix: Where is the model getting confused? (e.g., confusing "Dog" with "Wolf").
Feature Importance: Which columns mattered most?

3. Best Practices for High Accuracy

The exam loves asking why an AutoML model failed.

Unbalanced Data: If 99% of your images are "Healthy" and 1% are "Defect", AutoML will just guess "Healthy" every time and get 99% accuracy.
- Fix: Add more "Defect" images or use class weighting.
Data Leakage: Do not include "Future Information" in your training data.
- Example: Trying to predict "Will Purchase?", but including "Delivery Date" as a feature. If there is a delivery date, they already purchased!
Split Method: For Time Series, never use random split. You must use chronological split (Train on Jan-Nov, Test on Dec).

4. Code Example: Launching an AutoML Job

While you usually use the UI, you can automate this via the SDK.

from google.cloud import aiplatform

def create_automl_job(project_id, display_name, dataset_id):
    aiplatform.init(project=project_id)

    job = aiplatform.AutoMLImageTrainingJob(
        display_name=display_name,
        prediction_type="classification",
        multi_label=False
    )

    model = job.run(
        dataset=dataset_id,
        model_display_name="my-flower-model",
        budget_milli_node_hours=2000, # 2 hours
    )

    print(f"Model training finished: {model.resource_name}")

5. Summary

AutoML trades Compute Cost for Human Time. It uses a lot of computer power to save you from manual tuning.
Tabular holds the crown for structured data performance.
Data Quality is the bottleneck. AutoML cannot fix bad labels ("Garbage In, Garbage Out").

In the next lesson, we start the "Hard" stuff. MLOps. How do we convert raw data into features at scale? Dataflow and Dataprep.

Knowledge Check

Error: Quiz options are missing or invalid.

AutoML: High Quality, Low Code