I recently read the article “What I learned getting acquired by Google” by Shreyans Bhansali. Shreyans wrote

On the other hand there was the discovery that most Search improvements are manually reviewed by engineers through ‘side-by-side’ comparisons between old and new results…on spreadsheets!

The above quote reminded of how hard, and often understated, quality assurance (QA) in AI/ML systems is. Each change to a model needs to be validated, and validation is hard and cumbersome. Also, the fact that models can have a freshness does not help - that means that quality assurance must be done continuously and treated as a service level.

To make my case, I thought I could share a tale about a classification system I used to work on a bit.

A tale Link to heading

At a former employer we had a system that categorized a stream of financial transactions using ML. For example, “MacDonald’s” was categorized as “Restaurant”, and “H&M” was categorized as “Clothing”, etc. If we were uncertain, we set the category “Uncategorized”. The users could adjust incorrectly categorized transactions if they were wrong. Our goal was to measure the accuracy of how well these categories were applied by the ML model.

Was this category (in)correct?

Initially, we considered to ask for explicit feedback (“Was this category correct?”) from the user in the UI. However, we concluded that we did not want to make our UX bloated. We asked ourselves, can we somehow figure out whether our classification is accurate without changing our UX?

Our first iteration was a service level based on the ratio between the “number of manual corrections” and “total categorizations”. This did not work very well for two big reasons: Partially because it varied immensely between users and how eager they were to adjust incorrectly categorized data or not. But mostly because a lot of users were only adjusting the categories when they had bought something different in a store; ie. buying “makeup” from “H&M” instead of “Clothing”. This made our numbers look much worse than they were! We did not get any positive feedback on the correct classifications.

Our next take was to not categorize 1% of all financial transactions to force our users to set the correct category. When they did set it, we compared what they set against what our model would have guessed. Our service level was defined as the “number of adjustments that matched our ML model’s guess” divided by the “total number of adjustments”.

Randomly not categorising 1% of all financial transactions was a good idea! But it turned out to have a surprising backlash from users; They perceived our classification as accuracy having become significantly worse:

“Why are you unable to categorize ‘McDonalds’?? C’mon, I expect better from your product!

It turns out, we were not classifying some of the things we were certain about. Could we do better?

We were lucky that our ML could spit out a certainty measure for our classification between [0,1]. We started using the probability of skipping categorizing a transaction based on the inverse of that certainty. That meant “McDonald’s”, having a high certainty, was rarely skipped anymore. Good!

Instead of certainty, the inverse frequency could have been a different factor to use to avoid common descriptions being randomly skipped. As far as I know, we never pursued that approach.

Through a series of events, our business pivoted and started having customers using our classification API and presenting the result in a UI of their own. Since the customers could not always adjust incorrect categories in the UI, we had to resort to manual quality assurance instead where people would sit and verify that categories were correctly identified. This also made it hard for us to treat ML accuracy as a service level. We instead had to do more quality assurance before release time. Unfortunately that lead to longer iteration cycles for new models, but at least we knew we could trust the data fairly well.

On the topic of unit tests Link to heading

We did have unit tests in place for the descriptions of common financial transactions. And we were, over time, building up a suite of common descriptions and what they were supposed to match. However, there were a few problems with this:

  • The freshness I mentioned at the beginning of this post was a problem; Unit tests grew old. “Grand Daddy’s house” was once a pizza in one part of Sweden and a clothing store in another part of Sweden a couple of years later. We had to update our unit tests.
  • We ended up having different models for different countries/regions. Having unit tests for all of them didn’t scale. Plus, we did not know common merchants in all the regions - our company was mostly based out of Sweden.
  • The whole idea with the ML model was that it was supposed to scale and us not doing all the manual classification!

In conclusion Link to heading

Measuring AI/ML model quality takes a lot of creativity bordering SRE practises (such as service levels, release strategies) and UX aspects - and there can be some fun surprises along the way. :) Sometimes, manual QA through something like Amazon’s Mechanical Turk is the easiest way to go about it, but if you can somehow build in feedback mechanisms through your UX that is usually much better to continuously measure service quality.

The customer is always right. Manual quality assurance will never be as accurate as the one from customers, but it might be good enough.