
AutoML: Evaluation & Debugging
Your AutoML model is trained. Is it good? interpreting Confusion Matrices, Precision/Recall curves, and Feature Importance to fix underperforming models.
Reading the Dashboard
AutoML is "Black Box" training, but "Glass Box" evaluation. Vertex AI provides a rich dashboard to inspect the model.
1. Classification Metrics
For Multiclass Classification (e.g., Cat, Dog, Mouse), the most important tool is the Confusion Matrix.
| Predicted Cat | Predicted Dog | Predicted Mouse | |
|---|---|---|---|
| Actual Cat | 50 (Correct) | 5 (Error) | 0 |
| Actual Dog | 2 | 48 (Correct) | 0 |
| Actual Mouse | 0 | 20 | 30 (Poor) |
- Diagnosis: The model confuses Mice with Dogs 40% of the time.
- Action: Look at the "Mouse" images. Are they blurry? Do they look like small dogs? Add more distinct Mouse examples.
Precision vs. Recall Threshold
You can adjust the Confidence Threshold slider in the UI.
- High Threshold (0.9): High Precision. The model only speaks if it is 100% sure. (Use for: Medical Diagnosis where false positives are dangerous).
- Low Threshold (0.1): High Recall. The model detects everything, even if wrong. (Use for: Security Camera trying to find intruders).
2. Regression Metrics
For typical numbers (House Prices):
- RMSE (Root Mean Square Error): Penalizes outliers heavily. Use if big mistakes are bad.
- MAE (Mean Absolute Error): Average dollar error. Easier to explain to business ("We are off by $5k on average").
- MAPE (Percentage Error): "We are off by 5%." Good for comparing across different value scales.
3. Feature Importance (Tabular)
AutoML tells you which columns drove the prediction.
- Global Importances: "Overall, 'Income' is the #1 predictor."
- Local Importance: "For this specific customer, 'Age' was the #1 predictor."
Debugging Tip:
If the #1 feature is something like User_ID or Transaction_ID (Random Unique Identifiers), you have Data Leakage. The model memorized the IDs. Remove that column and retrain.
4. Summary
- Use the Confusion Matrix to find class-specific errors.
- Adjust the Threshold based on business needs (Precision vs Recall).
- Check Feature Importance to catch leakage.
Knowledge Check
?Knowledge Check
You trained an AutoML fraud detection model. The 'Feature Importance' chart shows that the column `transaction_timestamp` is the most important feature (99% importance). Why is this likely a problem?