If your machine learning model has a high correctness score on the holdout test data set, is it safe to deploy it in production?
All models are wrong, but some are useful.
— George E. P. Box (famous British statistician)
But the question I am asking is: Are more correct models more useful?
Recently, we trained a Speech Recognition for a customer, and the accuracy was higher than the given goal. On close examination of errors, we found that model did particularly poorly on transcribing numbers. Clearly, just model accuracy evaluation is not sufficient in deciding if the model is good enough to deploy.
Model Evaluation vs. Model Testing
In machine learning, we mostly focus on model evaluation: metrics and plots summarizing the correctness of a model on an unseen holdout test data set.
Model testing, on the other hand, is to check that model’s learned behavior is the same as “what we expect.” It is not as rigorously defined as model evaluation. Combing through model errors and characterizing errors (like I did in speech recognition and found the problem with numbers) is just one kind of testing.
For a rundown on pre-train and post-train, see Effective testing for machine learning systems by Jeremy Jordan. For some example test cases for the same, see: How to Test Machine Learning Code and Systems by Eugene Yan.
Model Explainability or Model Interpretability
The degree to which a model’s outcome can be understood by humans, and its decision-making “logic” can be explained. At least for non-DNN models, this is a very important part of testing.
10 Types of ML Tests
Dr. Srinivas Padmanabhuni list 10 types of tests that cover model evaluation, model testing, inference latency, etc., in 10 Tests for your AI/ML/DL model:
Randomized Testing with Train-Validation-Test Split: Typical test to measure model accuracy on unseen data.
Cross-Validation Techniques: Measure performance over several iterations of the splits of the data, e.g., K-Fold, LOOCV, Bootstrap.
Explainability Tests: Useful when models (like DNNs) are not interpretable. Mainly of two types: model-agnostic and model-specific tests.
Security Tests: To guard against adversarial attacks with poisoned data to fool the models. Again, two varieties: white-box (with knowledge of model parameters) and black-box.
Coverage Tests: A systematic approach to ensure that unseen data is diverse enough to cover broad varieties of input scenarios.
Bias / Fairness Tests: To ensure a model does not discriminate against any demography.
Privacy Tests: To prevent privacy attacks/breaches. Model inference should not make it possible to figure out the training data, and the inferred data should not have PII embedded in it.
Performance Tests: Whether the model inference happens within the latency SLAs of the use case.
Drift Tests: To guard against concept/data drift.
Tests for Agency: The closeness of model outcome to human behavior.
Detailed Examples with Code Samples
Similar to Eugene Yan’s article, but longer and with a different emphasis and more detailed code examples:
Testing ML Systems: Code, Data and Models (in Made With ML) by Gokul Mohandas
Snorkel Intro Tutorial: Data Slicing by Snorkel AI
When it comes to machine learning testing, it is quite different from software testing. It is not yet as mature and well understood as traditional testing.
For deploying in production, you should not focus solely on model evaluation, but also tests models for slices, runtime performance, bias, security, etc.