
Capstone Project: End-to-End Predictive Maintenance
Design a full ML system for a manufacturing plant. Ingest sensor data, train a forecasting model, deploy via CI/CD, and monitor for drift.
The Final Challenge: "FactoryGuard"
You are the Lead ML Engineer for a car manufacturer. Goal: Predict machine failure 24 hours in advance so maintenance can fix it.
1. Architecture Design
We need to connect: Sensors -> Pub/Sub -> Dataflow -> Vertex AI.
graph TD
Sensors[IoT Sensors] -->|MQTT| PubSub[Cloud Pub/Sub]
PubSub -->|Stream| Dataflow[Cloud Dataflow]
Dataflow -->|Raw Data| BQ[(BigQuery Historic Data)]
Dataflow -->|Features| FS[Vertex Feature Store Online]
subgraph "Training Pipeline (Weekly)"
BQ -->|Export| Train[Vertex AI Training XGBoost]
Train -->|Model| Registry[Model Registry]
end
subgraph "Serving (Real-time)"
PubSub -->|Realtime Event| Endpoint[Vertex Endpoints]
Endpoint -.->|Fetch History| FS
Endpoint -->|Prediction| Alert[Maintenance App]
end
style Sensors fill:#FFD700,stroke:#333,stroke-width:2px,color:#000
style Endpoint fill:#34A853,stroke:#fff,stroke-width:2px,color:#fff
2. Implementation Steps
Step 1: Data Ingestion (Streaming)
- Tool: Dataflow.
- Logic: Calculate "Avg Temp Last 1 Hour" (Windowing). Write features to Feature Store.
Step 2: Modeling (Tabular)
- Choice: XGBoost (Gradient Boosted Trees).
- Hardware: Standard CPU (n1-standard-16). No GPU needed for tabular XGBoost unless massive.
- Hyperparameter Tuning: Use Vertex Vizier to tune
max_depthandlearning_rate.
Step 3: Deployment (CI/CD)
- Trigger: Cloud Build on
git pushor Weekly Schedule. - Canary: Deploy to 10% of traffic. If failure rate spikes, rollback.
Step 4: Monitoring (Drift)
- Metric: Feature Drift on
Temperature. - Scenario: Winter arrives. Sensor baseline drops by 10 degrees.
- Action: Drift detection triggers automatic retraining pipeline.
3. Success Criteria
- Latency: Prediction < 100ms (Using Feature Store for fast lookups).
- Reliability: Automated Retraining handles seasonality.
- Governance: Lineage tracking shows exactly which training run produced the current model.
Conclusion
You have now seen the full picture. From BigQuery ML for quick prototypes, to TPUs for massive training, to Pipelines for automation. You are ready for the Google Cloud Professional ML Engineer exam.
Good luck!
Knowledge Check
?Knowledge Check
In the FactoryGuard architecture, why do we write streaming features to the Vertex AI Feature Store instead of directly to BigQuery?