This post kicks off a new series on performance metrics, with a brief discussion of the most widely-known approach: accuracy = # correct/total. Key takeaway: accuracy is most useful when the data set is balanced.

Consider a data set of cat and dog photos. You want to build a model to determine if a photo is of a cat or a dog. Let’s say you have a naive model that always guesses “cat.” Here’s how the accuracy changes based on the skew of the data:

 # of cat photos # of dog photos Accuracy of a naïve model that always guesses “cat” 10 90 10% 20 80 20% 50 50 50% (Balanced Data Set) 80 20 80% 90 10 90% 99 1 99% 5,000,000 150 5,000,000/5,000,150 = 99.99 (Very Imbalanced Data Set)

As you can see, the accuracy of a silly model that always guesses “cat” can still be very high, if the data set contains vastly more cat photos than dog photos. This model doesn’t know anything about what cats look like or what dogs look like, but if we just look at the accuracy, we might think that the model is performing well.

One risk when training a machine learning model on an imbalanced data set is that the model may learn to always output the majority class as its prediction. This is not a useful model, but it will achieve high accuracy.

Here are tips about how to use accuracy for judging machine learning models:

• If your classes are balanced (equal number of cats and dogs) then accuracy can be a useful performance metric
• If your classes are not balanced, then calculate the “naive model accuracy” as “# of examples in majority class / total examples.” Then, when you look at the accuracy of your machine learning model, you can compare it to the accuracy of the naive model.
• If you have 80 cat photos and 20 dog photos, your naive accuracy is 80%; thus, a machine learning model achieving 80% on this data set is not doing any better than a naive model
• On the other hand, a machine learning model achieving 95% has learned something!
• If your classes are not balanced, you should also calculate performance metrics that are more informative than accuracy, such as the area under the receiver operating characteristic (aka AUROC, AUC, c-statistic) or the area under the precision-recall curve (AUPRC)

That’s it for accuracy! Future posts in this series on performance metrics will discuss AUROC and AUPRC.

Photo credits: kitten, puppy