Street fighting machine learning part 2

Another story from the time back, I was on a small team developing a classifier for an NLP problem. The output will show up in a client-facing interface.

The evaluation procedure is plain CV/stratified CV. It’s the standard bible, what could go wrong, right ? The CV result was good, even after adjusting for for an highly imbalanced dataset, I could feel confident in the solution.

But I kept running into this problem, that from time to time, a client complained that they kept seeing the same-same-but-different texts miss classified its supposed label, even though we have explicitly added it to the training set.

Missing out on a totally new case that does not look like anything we’ve seen is at least an acceptable error, etymologically speaking (How do we know what we don’t know,right ? ). But in this case, we were missing out on something similar with what we’ve seen. Now that this is serious. As over time, it will erode the client’s trust in the classified result.

It’s now crucial that we need to find a way to QA the result of the classifier. It was then that we realized the single testset, even CV, is not going to give me enough insights into how the model was performing in production. We need to know how the model would perform exactly in different situations. Then decide on the result whether or not it can be released.

Does that ring a bell to you ?

Machine learning models as softwares

In vanilla software development, we write code to perform certain tasks in certain ways. With ML, we kinda write a piece of code that writes code. This saves us a large amount of time in figuring out the logics of such softwares, by telling the machine to learn by examples.

After seeing the data, the model logic is now fully specified. It becomes a deterministic piece of software.

However, not all the things that the machine learns is what we intended it to be. The models will learn a bunch of random bullshit if we force it. This brings the need to control the behavior of such models. We now want the piece of code (that the machine writes) to do things in certain ways before we release it. Or in other words, we want tests.

Testing ML models like we test softwares

From traditional development testing, we want our piece of code to perform in a certain, known way. In order to do that, we think about cases when it may break, and test the piece of code against those cases. We gain confidence that this piece will not break when operated in the real world, because we have already seen how it performs in certain scenarios.

This is what I lack in the example above, when all that I used to control the quality is testing CV scores. Now that there is nothing wrong with testing models like this. This is the standard bible of testing, if it’s confined to academic purposes, when we want to compare algorithms head on.

Testing and QA as a part of the software dev process has a different purpose compared to algorithm testing. When we write a piece of code, we want it to do exactly what and how it works.

Constructing test cases for ML models

Given that there are so many approaches to modelling, there is no one size fits all when it comes to testing different models. In many domains, we can construct synthetic samples that simulate some certain scenarios. This is quite popular in the finance sector for example.

In other domains, synthetic datasets are harder to come by. That forces us to look more carefully into our data in order to come up with scenario-based test sets. For some algorithms work by finding similar samples and group them to the same regions.

Those groups can be used to construct tactful real world testing datasets. Each sample from such a group could be used as a representative of such a group of cases. We can later test other models against such test sets, and see how they perform in different cases. This kind of compact test set also helps us find leads when a model constantly fails a test case.

Kernel methods and similarity based methods are good examples. This also extends to neural networks, where we can think about them as linear models built on top of a kernel, and trained jointly. If we take the output of the layers before the final linear layers, most often we can find that similar samples are located close to each other.

Just like we have different testing paradigms for different kinds of softwares, constructing test sets will also be different when we choose different ML algorithms. Testing for classification models will be different from that of regression.

We’ll explore the technical details in the next post, when we’ll go into an example of constructing such test cases.

Written on July 7, 2022