Interpreting Accuracy, Precision, Recall, and F1 Score Metrics

Saif Rahman
4 min readMay 29, 2021

--

Photo by William Warby on Unsplash

When it comes to evaluating ML classification models, the different metrics can be confusing. In this article, I’ll explain the differences in the most popular classification metrics.

I will be using the following terms, check here for reference:

True Positive: Total number of correctly predicted positive data points

True Negative: Total number of correctly predicted negative data points

False Positive: Total number of incorrectly predicted positive data points

False Negative: Total number of incorrectly predicted negative data points

Accuracy

The accuracy metric is the easiest to interpret. Essentially, it measures how well your model did as a whole. It takes the total number of correctly predicted data points and divides it by the total number of data points.

Let’s say you have 50 total data points. Your model correctly predicts 20 true positives and 15 true negatives. That means the accuracy of your model is 35/50 = 70%. That’s a decent model, however, it would be nice to have more detail in terms of which types of data points our model was better at predicting, that’s where the other metrics come in.

Precision

The precision score is a metric to understand how well your model correctly predicts positive observations. It takes the total number of correctly predicted positive observations divided by the total number of predicted positive observations.

Let’s say you have a model that predicts whether an email is spam or not. Assume there are 100 total data points. Your model has 10 true positives and 60 true negatives. That means the accuracy of your model is 70/100, which is 70%. You might assume your model is pretty good. However, let’s say your model had 20 false positives, meaning it incorrectly labeled an email as spam 20 out of 30 times, or in other words! In terms of precision, the precision score calculates at 10/30, approximately 33%. That means your model isn’t so accurate at predicting whether an email is spam, and could potentially cause a user to miss an important e-mail.

Recall

The recall metric measures how well your model correctly predicted all possible positive observations. It takes the total number of correctly predicted positive data points and divides it by the total number of all positive data points.

Let’s say you have a model that labeled a tumor as malignant (cancerous) or benign (noncancerous). You run your model on 2000 data points, and see that it has 200 true positives and 1200 true negatives, meaning an accuracy score of 70%, again, pretty good. However, let’s say there were actually 600 positive instances of malignant tumors. Your model only detected 200 of them, meaning it was able to detect 33% of cancerous tumors. That would extremely dangerous to deploy into production as it missed most tumors, potentially delaying treatment.

On another note, let’s say you run your model on 2000 data points, and see that it has 200 true positives and 500 true negatives, meaning an accuracy score of 35%, which isn’t so good. But let’s say there were 250 cases of malignant tumors, which means your model was 80% accurate at detecting cancerous instances of tumors.

F1 Score

The F1 Score metric takes the weighted average of precision and recall. It has more of a focus on false negatives and false positives.

Let’s say your malignant tumor prediction model has a precision score of 10% (0.1) and a recall of 90% (0.9), the F1 score would be 18%. That means you have a high rate of false positives and false negatives.

Overall, choosing which scoring metric depends on your use case, there is no right metric to measure with. If your business can afford to have lots of false positives and negatives, maybe accuracy isn’t a bad metric, but if you’re creating a model for predicting cancerous tumors, false negatives are much more serious. Thus, the F1 score or recall may be better. It’s always about context!

--

--

No responses yet