Capstone Project: End-to-End Predictive Maintenance
·ProfessionalEngineeringCertifications

Capstone Project: End-to-End Predictive Maintenance

Design a full ML system for a manufacturing plant. Ingest sensor data, train a forecasting model, deploy via CI/CD, and monitor for drift.

The Final Challenge: "FactoryGuard"

You are the Lead ML Engineer for a car manufacturer. Goal: Predict machine failure 24 hours in advance so maintenance can fix it.


1. Architecture Design

We need to connect: Sensors -> Pub/Sub -> Dataflow -> Vertex AI.

graph TD
    Sensors[IoT Sensors] -->|MQTT| PubSub[Cloud Pub/Sub]
    PubSub -->|Stream| Dataflow[Cloud Dataflow]

    Dataflow -->|Raw Data| BQ[(BigQuery Historic Data)]
    Dataflow -->|Features| FS[Vertex Feature Store Online]

    subgraph "Training Pipeline (Weekly)"
        BQ -->|Export| Train[Vertex AI Training XGBoost]
        Train -->|Model| Registry[Model Registry]
    end
    subgraph "Serving (Real-time)"
        PubSub -->|Realtime Event| Endpoint[Vertex Endpoints]
        Endpoint -.->|Fetch History| FS
        Endpoint -->|Prediction| Alert[Maintenance App]
    end

    style Sensors fill:#FFD700,stroke:#333,stroke-width:2px,color:#000
    style Endpoint fill:#34A853,stroke:#fff,stroke-width:2px,color:#fff

2. Implementation Steps

Step 1: Data Ingestion (Streaming)

  • Tool: Dataflow.
  • Logic: Calculate "Avg Temp Last 1 Hour" (Windowing). Write features to Feature Store.

Step 2: Modeling (Tabular)

  • Choice: XGBoost (Gradient Boosted Trees).
  • Hardware: Standard CPU (n1-standard-16). No GPU needed for tabular XGBoost unless massive.
  • Hyperparameter Tuning: Use Vertex Vizier to tune max_depth and learning_rate.

Step 3: Deployment (CI/CD)

  • Trigger: Cloud Build on git push or Weekly Schedule.
  • Canary: Deploy to 10% of traffic. If failure rate spikes, rollback.

Step 4: Monitoring (Drift)

  • Metric: Feature Drift on Temperature.
  • Scenario: Winter arrives. Sensor baseline drops by 10 degrees.
  • Action: Drift detection triggers automatic retraining pipeline.

3. Success Criteria

  1. Latency: Prediction < 100ms (Using Feature Store for fast lookups).
  2. Reliability: Automated Retraining handles seasonality.
  3. Governance: Lineage tracking shows exactly which training run produced the current model.

Conclusion

You have now seen the full picture. From BigQuery ML for quick prototypes, to TPUs for massive training, to Pipelines for automation. You are ready for the Google Cloud Professional ML Engineer exam.

Good luck!


Knowledge Check

?Knowledge Check

In the FactoryGuard architecture, why do we write streaming features to the Vertex AI Feature Store instead of directly to BigQuery?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn