
AutoML: High Quality, Low Code
How to train custom models without writing training loops. We cover AutoML for Vision, Tables, and Text, and how to prepare your data for success.
The "Middle Ground" Framework
We have seen standard APIs (Lesson 2.2) and SQL-based ML (Lesson 2.1). AutoML is the solution when:
- APIs fail: The pre-trained model doesn't know your specific data (e.g., distinguishing between a "Healthy Leaf" and a "Diseased Leaf").
- Custom Code is too hard: You don't have a team of PhDs to tweak hyperparameters for a month.
AutoML (now part of Vertex AI) uses Google's Neural Architecture Search (NAS) to automatically find the best model architecture for your data.
1. Supported Data Types in Vertex AI AutoML
| Data Type | Task | Example |
|---|---|---|
| Image | Classification, Object Detection | Finding defects in manufacturing parts. |
| Video | Classification, Object Tracking | Identifying a specific player in a soccer game. |
| Text | Classification, Entity Extraction, Sentiment | Classifying legal contracts by "Jurisdiction". |
| Tabular | Classification, Regression, Forecasting | Predicting customer churn (Rows & Columns). |
Note: AutoML Tabular is famously powerful. It frequently beats human data scientists in Kaggle competitions because it automatically ensembles models (GBDT, Deep, Cross-networks).
2. The AutoML Workflow
The exam tests you on the process, particularly the data requirements.
graph TD
Data[Raw Data] --> Label[Labeling (Ground Truth)]
Label --> Import[Dataset Import (Vertex AI)]
Import --> Train[AutoML Training Job]
Train --> Eval[Evaluation Metrics]
Eval --> Deploy[Deploy to Endpoint]
style Train fill:#4285F4,stroke:#fff,stroke-width:2px,color:#fff
Step 1: Data Preparation & Labeling
- Split: Vertex AI automatically splits your data (80% Train, 10% Validate, 10% Test) unless you specify a Manual Split column.
- Minimums:
- Vision: 100 images per label (minimum), 1000+ (recommended).
- Tabular: 1000 rows minimum.
Step 2: Training Budget
You don't choose "Learning Rate." You choose "Node Hours."
- "Run for maximum 2 hours."
- Vertex AI will try hundreds of architectures. If it finds a great one in 1 hour, it stops early (Early Stopping) to save you money.
Step 3: Evaluation
Vertex AI generates a dashboard with:
- Confusion Matrix: Where is the model getting confused? (e.g., confusing "Dog" with "Wolf").
- Feature Importance: Which columns mattered most?
3. Best Practices for High Accuracy
The exam loves asking why an AutoML model failed.
- Unbalanced Data: If 99% of your images are "Healthy" and 1% are "Defect", AutoML will just guess "Healthy" every time and get 99% accuracy.
- Fix: Add more "Defect" images or use class weighting.
- Data Leakage: Do not include "Future Information" in your training data.
- Example: Trying to predict "Will Purchase?", but including "Delivery Date" as a feature. If there is a delivery date, they already purchased!
- Split Method: For Time Series, never use random split. You must use chronological split (Train on Jan-Nov, Test on Dec).
4. Code Example: Launching an AutoML Job
While you usually use the UI, you can automate this via the SDK.
from google.cloud import aiplatform
def create_automl_job(project_id, display_name, dataset_id):
aiplatform.init(project=project_id)
job = aiplatform.AutoMLImageTrainingJob(
display_name=display_name,
prediction_type="classification",
multi_label=False
)
model = job.run(
dataset=dataset_id,
model_display_name="my-flower-model",
budget_milli_node_hours=2000, # 2 hours
)
print(f"Model training finished: {model.resource_name}")
5. Summary
- AutoML trades Compute Cost for Human Time. It uses a lot of computer power to save you from manual tuning.
- Tabular holds the crown for structured data performance.
- Data Quality is the bottleneck. AutoML cannot fix bad labels ("Garbage In, Garbage Out").
In the next lesson, we start the "Hard" stuff. MLOps. How do we convert raw data into features at scale? Dataflow and Dataprep.
Knowledge Check
?Knowledge Check
You are training an AutoML Vision model to detect 5 different species of birds. Your dataset has 1000 images of Sparrows, but only 10 images of Eagles. Additional Eagle images are impossible to obtain. What is the best strategy to improve model performance for Eagles?