Stage 5.4. Model Testing#
Testing in machine learning is a bit different from traditional software testing. Here we just give an intuition on how different it is. Eugene Yan has written a series of articles on testing in machine learning.
Eugene highlighted that software involves having some input data and some handcrafted logic that processes the data to produce some output data, which is then compared against the expected output - a deterministic process. In contrast, machine learning involves having some input data and output data and with a suitable learning algorithm \(\mathcal{A}\), we can learn a model \(\mathcal{G}\) that can predict the output data from the input data. The process involves a learned logic and when we want to test the model with the learned logic, we would then actually need to load the model and run it on some input data to get the output data to compare against the expected output. And it is also common to compare loss for each epoch against a threshold to see if the model is learning.
In my own experience, I always prepare a debug dataset, that is usually sampled (stratified, grouped if needed) from the training dataset. This dataset can be used as your fixture in testing. But more importantly, one should also use this debug dataset to test sanity of your training pipeline. For example, like what Eugene and Karpathy suggested, run your model for a certain number of steps and check if the loss is decreasing, and you can even craft it to be overfit on the debug dataset to see if your model \(\mathcal{G}\) has capacity to learn the data. Furthermore, during hyperparameter tuning, it is very expensive to run the model on the full dataset, so you can use the debug dataset to test if your model is reacting well to the hyperparameters (i.e. learning rate finder).