How to Streamline ML Model Deployment? Automated Sanity Checks.

Published in

Intuit Engineering

7 min readApr 28, 2023

It’s likely you’ve run into this. You’ve created a machine learning (ML) model and trained it with plenty of data. Validation results look very promising, so you ship the model to pre-production and suddenly something’s not right. The model hasn’t failed outright, but your metrics do not align with what you observed in the training and evaluation phase. It’s time to block out time to do a sanity check so you can figure out what went wrong and how to fix it.

But if you could automate that sanity check, you could just…fix it. That could save you (and everyone on your team) a lot of energy and stress.

At Intuit, we deploy ML models across a range of mission-critical systems on our financial technology platform to deliver personalized AI experiences to more than 100 million consumer and small business customers with TurboTax, Credit Karma, QuickBooks, and Mailchimp (with conversational AI, recommendations, and security, risk and fraud, to name just a few). To streamline deployment, reduce opportunities to introduce human error and ensure models behave as expected, from development through production, we employ an automated sanity check.

Model sanity check: catching sources of systematic errors or bias in pre-production

It’s critical to ensure ML models are reliable and accurate before deploying them to production. Offline validation tests help catch design issues such as overfitting or underfitting problems. Pre-production environments can still introduce glitches or systematic errors that make a model unfit for production. These issues are referred to as ML bias where a ML model produces systematically prejudiced results in favor of or against one category of the model input. For example:

Sampling bias, introduced by differences between datasets used for offline training and validation compared to online datasets your model encounters in pre-production.
Prediction bias, where the difference between the average predicted outcome and the average observed real-world outcome differ between offline and online datasets.
Prediction distribution bias, where the distribution of online predictions does not conform to the distribution of the labels in the training and validation set. For example, if a model produces normally distributed forecasts in the training and validation phase, we expect the distribution of future values to follow a normal distribution as well.

A machine learning model sanity check is a set of tests performed in a pre-production environment to detect these sorts of systematic errors and biases, so you can ensure models work as expected before deploying them to production. At Intuit, we take model bias detection and prevention as the integral part of each model evaluation and deployment.

The four basic steps of a sanity check

A typical ML sanity check should include the four steps depicted in Figure 1:

1. Ensure online model scores sink to the output store. This step is the most straightforward. In a properly configured system, we can expect the model scores to sink to the output location. We can verify this by successfully fetching model scores from the expected output. For example, if the model scores are supposed to be written to a Hive table, we can execute a simple SQL query to fetch the predictions.

If not, one of the following may be responsible:

Lack of access or inadequate permission to access the output store.
Inability to read from input data due to access permissions.
Bugs in the ML pipeline that prevent the model from producing results.
Failure of infrastructure.
Improper integration between the data pipeline and the ML service.

2. Rescore with the offline model. Once we confirm that our model is producing predictions, the next step is to feed a sample set of the online input data to the offline model and ensure we get the same result. At this point, it’s a good idea to double-check your input data to be sure your test sample represents all the categories your model intends to predict and those samples cover the entire score range. For example, if your sample consists of 100 emails to check for spam, and all of them belong to non-spam class, then your sample data is questionable and the prediction results are misleading. If the offline and online scores do not match for a representative sample of input data, then we dig more deeply into the predictions being made.

Note that although we would generally expect the same prediction for a given input dataset in online and offline models, a certain level of difference may be tolerable in some situations. For example, a minor difference between offline and online predictions would not be a red flag for algorithms such as Bayesian neural networks which consists of a stochastic component that introduces randomness to the prediction process. Likewise, differences between machines’ libraries and dependencies might produce acceptable variations in model results.

3. Compare the online model score to the offline model score. In this stage, we check and validate the statistical characteristics of the online model score compared to the scores in the offline dataset as well as the test and validation dataset used before deploying to pre-production. If the model is working properly, we would expect the distribution of the predictions in the online model to follow a quite similar pattern as the offline predictions and training/test predictions. A large mismatch (for example a 20% difference between predictions in score bins) is a red flag that the model is not performing well.

If this step uncovers significant variance between online and offline models, we need to inspect our input data for potential issues.

4. Check input data. If we uncover significant variance between online and offline models, we need to inspect our input data for potential issues. This involves examining the characteristics of the input features that are fed to the online model and comparing them with those of the train/test/validation data. The checks here can include:

Verifying that all features are available in the input. All the features associated with weights contribute to generating prediction scores. A missing feature could corrupt the actual score.
Checking online storage to make sure no issues exist that could have a negative effect on input data. That may include access permissions, timeout settings, checking data locations, etc.
Validating the proportion of null values for features. For example, if we have observed 10% null values in our training/validation set, then a proportion of 20% null values could create a problem with the model itself or an upstream service.
Validating the distribution of feature values and comparing those with their counterparts in the train/test/validation set. The distribution of values should fall into very similar shapes for each feature. A large deviation suggests the online input data is disordered in the data pipeline and needs troubleshooting.

Figure 2 shows two diagrams for comparing the distribution of categorical features and scalar features. Looking at Categorical_Feature_1 at the left, we observe that the distribution of four categories are close in online data compared to the test data. So we can trust that our test data represent the online data. Likewise, For the Continuous_Feature_1 at the right, we see that the distribution pattern for online and test data are following a similar pattern.

Figure 2: Normalized distribution of feature values in online input (production) vs. test set.

Automate the process to make it even saner

To save time and reduce human error, we can embed this sanity check process in an automation workflow. The tools available for this process depend on your organization’s tech stack. For example, each step can be called through a Step Function in an AWS environment, or the process could be implemented by creating a Kubeflow pipeline or an Apache Airflow job with the all steps chained together. Each step can also be implemented as a Python or PySpark script. However you deploy the testing, you’ll want to generate a comprehensive report containing the analysis results of each step. This process can be run in the background, saving the considerable time it would take to perform these checks manually and greatly reducing the potential for human error.

Streamlining model development and deployment

ML models are a critical tool for improving the ease of use and security of a wide range of consumer- and enterprise-focused applications. Having an automated model sanity check in place can help teams deploy their models to production more confidently by ensuring models behave as designed, and reducing the resources needed to address those models when their behavior does not meet expectations.

As a result, teams can focus their energies on developing new and better models that they can then deploy to the benefit of their end users.

Many thanks to the team!

I would like to thank Daniel Viegas, Andreas Mavrommatis, Richard Chang, Lin Tao, and April Liu for their invaluable contributions!