Detecting Training-Serving Skew
·ProfessionalEngineeringCertifications

Detecting Training-Serving Skew

How to detect and prevent training-serving skew. A guide to using TensorFlow Data Validation (TFDV) to compare your training and serving data.

The Sneakiest Bug in ML

Training-serving skew is a subtle but common problem in ML systems. It occurs when there is a difference between the data that you use to train your model and the data that you use to serve it. This can lead to a significant drop in your model's performance.


1. Causes of Training-Serving Skew

There are two main causes of training-serving skew:

  • Schema Skew: This occurs when there is a difference in the schema of your training and serving data. For example, you might add a new feature to your serving data but forget to update your training data.
  • Distribution Skew: This occurs when there is a difference in the distribution of your training and serving data. For example, you might train your model on data from one country but then serve it to users in a different country.

2. Detecting Training-Serving Skew

The best way to detect training-serving skew is to use TensorFlow Data Validation (TFDV). TFDV can be used to:

  • Generate descriptive statistics for your training and serving data.
  • Compare the statistics of your training and serving data to identify any differences.
  • Infer a schema from your training data and then use it to validate your serving data.

3. Preventing Training-Serving Skew

The best way to prevent training-serving skew is to use a single, unified data pipeline for both training and serving. This will ensure that the same data preprocessing and feature engineering steps are applied to both your training and serving data.

If you cannot use a single data pipeline, you should use a tool like TFDV to validate your data and ensure that there are no differences between your training and serving data.


Knowledge Check

?Knowledge Check

You are training a model to predict the price of a house. You train the model on a dataset that includes a feature for the size of the house in square feet. You then serve the model to users who provide the size of their house in square meters. What type of skew is this?

Subscribe to our newsletter

Get the latest posts delivered right to your inbox.

Subscribe on LinkedIn