Precision or Recall When to Choose What? Can F1-Score help?

Vishal Kumar
6 min readJul 7, 2022

--

Is Accuracy a good measure?

When we talk about performance measurement of a machine learning classification model the first term that comes to our mind is Accuracy which is very simple yet very powerful. But can accuracy really help to understand the model in every scenario? The Answer is No. Let’s see when accuracy can fail to tell whether a model is good or not.

Test Data for testing the model

Now let’s assume the model which we’re testing is a “dumb model”. This means in most cases the model classifies the input data in one class only. For our example let’s assume it’s a Negative class only. We will denote the Negative class as ‘0’ and the Positive class as ‘1’.

Failure case of Accuracy Measure

In the above image, there are 100 data points to test the model out of which 90 belong to the negative class and 10 belong to the positive class. Since our model is dumb and it is predicting every input to be a negative class output that’s why there are no True Positive(TP) responses and the accuracy is coming to 0.9 which is 90%.

If you look at the number only, so 90% is a very good number but if you look closely at the confusion matrix, so all the predictions are made in the favour of negative class only. Hence we cannot only look at the accuracy and decide whether the model is good or not.

Other performance measures can be used in such cases like Precision and Recall. But before that let’s understand confusion matrix in detail.

Understanding Confusion Matrix in detail:

Confusion Matrix

Terminologies:

TN: True Negative (All the points which are actually negative and predicted negative)

FN: False Negative (All the points which are actually positive but predicted negative)

FP: False Positive (All the points which are actually negative but predicted positive)

TP: True Positive (All the points which are actually positive and predicted positive)

Predicted Negative: All the points which are predicted negative by the model.

Predicted Positive: All the points which are predicted positive by the model.

Actual Negative: All the points which are actually negative.

Actual Positive: All the points which are actually positive.

There are several insights we can get from confusion matrix such as:

Accuracy = (TP+TN)/(TP+TN+FP+FN)

True Positive Rate (TPR) = TP/(TP+FN)

True Negative Rate (TNR) = TN/(TN+FP)

False Positive Rate (FPR) = FP/(TN+FP)

False Negative Rate (FNR) = FN/(TP+FN)

TPR: Tells about the proportion of correctly predicted positive points out of all the actual positive points.

TNR: Tells about the proportion of correctly predicted negative points out of all the actual negative points.

FNR: Tells about the proportion of wrongly predicted negative points out of all the actual positive points.

FPR: Tells about the proportion of wrongly predicted positive points out of all the actual negative points.

The debate of choosing between precision and recall is very old and very confusing too. So let’s understand it with some simple examples.

Precision:

Precision tells about from all the points the model has predicted as positive what proportion is actually positive. So it focuses on False Positives. Lower the false positive, higher the precision.

precision = TP/(TP+FP)

Recall:

Recall tells about from all the actual positive points what proportion model has predicted positive correctly. It is nothing but the true positive rate only. So it focuses on False Negatives. Lower the false negative, higher the recall.

recall = TP/(TP+FN)

Now the question is when to choose what?

Let’s see with some examples:

Use Case1: We have a binary classifier machine learning model which takes the website URL as an input and tells whether the website is safe or unsafe for children. So the output is (safe, unsafe).

Labels: Safe:1, unsafe:0

now let’s see this case in confusion matrix and understand what we should focus on.

Use Case1 with confusion matrix

In this case, our main focus is that any URL which is unsafe for kids should not be classified as safe. On the other hand, if any safe URL is classified as unsafe that is not a big problem.

So we understood that we need to reduce the number of false safe predictions(wrongly classified as safe). In the above image, we can see False safe is nothing but a False Positive(FP) so now we just need to check between precision and recall which more focuses on False Positive reduction.

In the above definition of precision and recall, we got to know that precision focuses on False Positive hence we can say that for this use case we can consider precision as a good performance measure.

Use Case2: We have a binary classifier machine learning model which takes the human diagnosis reports as input and tells whether the human has cancer or not. So the output is (has_cancer, no_cancer).

Label: has_cancer:1, no_cancer:0

now let’s see this case also in confusion matrix and understand what we should focus on.

Use Case2 with confusion matrix

In this case, our main focus is that any human who has cancer should not be classified as no_cancer since it is a very critical case any human who has cancer if classified as no_cancer will be in danger. But if a person who does not have cancer is classified as has_cancer may go for further tests and be sure about it.

So we understood that we need to reduce the number of False no_cancer predictions (wrongly classified as no_cancer). In the above image, we can see that False no_cancer is nothing but False Negative(FN) so now we just need to check between precision and recall which more focuses on False Negative reduction.

In the above definition of precision and recall, we got to know that recall focuses on False Negative hence we can say that for this use case we can consider recall as a good performance measure.

In both the above use cases, either precision was helpful or recall what if we need both to be high. In such cases, we can use F1-Score.

F1-Score:

F1-Score is the harmonic mean of precision and recall. Since it has both precision and recall it is considered to be a good measure.

F1-score = 2 * (precision * recall)/(precision + recall)

Now one question must be coming to your head why F1-Score is called harmonic mean. Let’s find out why?

So there are different kinds of means available that we can use and we will use them to see what they generate for precision=0.1 and recall=0.9 . We have taken a low precision and high recall so the final score should not be good enough to accept the model as both precision and recall should be high for the model to be selected.

Arithmetic Mean of Precision and Recall:

AM = (precision + recall)/2

AM = (0.1+0.9)/2 = 0.5

Geometric Mean of Precision and Recall:

GM = sqrt(precision*recall)

GM = sqrt(0.1*0.9) = 0.3

Harmonic mean of Precision and Recall:

HM = 2 * (precision * recall)/(precision + recall)

HM = 2 * (0.1 * 0.9)/(0.1 + 0.9) = 0.18

In all the three means results the most penalized mean is the harmonic mean hence we use it because if any of the values are very low it will penalize the result.

Hope this article cleared your doubts related to precision, recall, and F1-Score.

--

--

Vishal Kumar

Data Scientist, Data Science Enthusiast working on NLP, Knowledge Graphs and deep learning.