Confusion Matrix & Classification Metrics Cheat Sheet

This cheat sheet covers how to summarize and evaluate classification models using a confusion matrix. Students need it to understand how predictions can be correct or incorrect in different ways. It is especially useful in statistics, data science, machine learning, and real-world decision problems such as medical testing or spam detection.

The core idea is to compare predicted classes with actual classes, then count true positives, false positives, true negatives, and false negatives. From these counts, you can calculate metrics such as accuracy, precision, recall, specificity, and $F_1$ score. These metrics answer different questions, so the best metric depends on the cost of each kind of error.

Key Facts

A confusion matrix organizes classification results into $TP$ , $FP$ , $TN$ , and $FN$ by comparing predicted labels with actual labels.
Accuracy measures the overall fraction of correct predictions and is calculated by $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$ .
Precision measures how many predicted positives were actually positive and is calculated by $\text{Precision} = \frac{TP}{TP + FP}$ .
Recall, also called sensitivity, measures how many actual positives were correctly found and is calculated by $\text{Recall} = \frac{TP}{TP + FN}$ .
Specificity measures how many actual negatives were correctly identified and is calculated by $\text{Specificity} = \frac{TN}{TN + FP}$ .
The false positive rate is calculated by $\text{FPR} = \frac{FP}{FP + TN}$ , and it equals $1 - \text{Specificity}$ .
The false negative rate is calculated by $\text{FNR} = \frac{FN}{FN + TP}$ , and it equals $1 - \text{Recall}$ .
The $F_1$ score balances precision and recall using $F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ .

Vocabulary

Confusion matrix: A table that compares predicted classes with actual classes to show correct and incorrect classification results.
True positive: A true positive, written $TP$ , is a case where the model predicts positive and the actual class is positive.
False positive: A false positive, written $FP$ , is a case where the model predicts positive but the actual class is negative.
False negative: A false negative, written $FN$ , is a case where the model predicts negative but the actual class is positive.
Precision: Precision is the proportion of positive predictions that are correct, calculated by $\frac{TP}{TP + FP}$ .
Recall: Recall is the proportion of actual positives that are correctly identified, calculated by $\frac{TP}{TP + FN}$ .

Common Mistakes to Avoid

Confusing false positives with false negatives is wrong because they describe opposite error types. A false positive predicts positive when the truth is negative, while a false negative predicts negative when the truth is positive.
Using accuracy alone with imbalanced data is misleading because a model can look accurate by mostly predicting the majority class. For rare positives, precision, recall, and $F_1$ score often give better information.
Putting $FP$ and $FN$ in the wrong cells changes every metric that uses them. Always check whether rows represent actual labels or predicted labels before calculating.
Treating precision and recall as the same metric is wrong because precision focuses on positive predictions, while recall focuses on actual positives. A model can have high precision but low recall, or high recall but low precision.
Raising or lowering the classification threshold without checking error tradeoffs is risky because it usually changes both $FP$ and $FN$ . A lower threshold often increases recall but may decrease precision.

Practice Questions

1 A classifier has $TP = 45$ , $TN = 40$ , $FP = 10$ , and $FN = 5$ . Calculate the accuracy.
2 A medical test gives $TP = 80$ , $FP = 20$ , and $FN = 10$ . Calculate the precision and recall.
3 A model has precision $0.75$ and recall $0.60$ . Calculate the $F_1$ score using $F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ .
4 For a disease screening test, explain whether false positives or false negatives are usually more dangerous, and justify which metric should be emphasized.

Understanding Confusion Matrix & Classification Metrics

A classifier often begins by giving each case a score from zero to one. A higher score means the model sees stronger evidence for the positive class. A threshold turns that score into a final label.

For example, a school email filter might mark messages above a chosen score as spam. Moving the threshold lower catches more spam, but it can wrongly block more ordinary messages. Moving it higher protects ordinary messages, but some spam gets through.

The counts in the matrix change whenever the threshold changes. This is why a model is not judged only by its original predictions. Its decision rule must match the situation where it will be used.

The seriousness of an error depends on the context. In a screening test for a dangerous disease, a missed positive result can delay care. High recall is often important at this first stage.

A positive screening result may then be checked with a more accurate test. In an automated system that removes posts or rejects loan applications, a false positive can unfairly affect someone. Precision and specificity become especially important there.

No metric can decide what is fair or safe by itself. People must state which mistakes cause the greatest harm before choosing a threshold or comparing models.

Accuracy can give a misleading picture when one class is rare. Imagine one thousand transactions where only ten are fraudulent. A model that labels every transaction as normal gets nine hundred ninety predictions right.

Its accuracy looks extremely high, yet it finds no fraud at all. Looking at recall reveals that failure. Looking at precision shows whether fraud alerts are usually worth investigating.

The F one score is useful when both precision and recall matter, since it becomes low if either one is low. Still, it does not include true negatives, so it may not fit every problem. Always inspect the actual counts, not only rounded metric values.

Students should first identify which class has been called positive. Positive does not mean good. It simply names the outcome the model is trying to detect, such as a fault, a disease, or a spam message.

Then read each error as a sentence about a real case. A false negative is a case that was truly positive but was missed. A false positive is a case that was truly negative but was flagged.

Check the denominator in every calculation because it tells you what group is being measured. Precision starts with predicted positives. Recall starts with actual positives.

Finally, use data not used for training when reporting performance. Testing on training data can make a model seem better than it will be with new cases.

Sign in to save

Sign in to save

Confusion Matrix & Classification Metrics Cheat Sheet

Related Tools

Related Labs

Related Worksheets

Related Infographics

Study as Flashcards

Key Facts

Vocabulary

Common Mistakes to Avoid

Practice Questions

Understanding Confusion Matrix & Classification Metrics