Sign in to save

Bookmark this page so you can find it later.

Sign in to save

Bookmark this page so you can find it later.

Confusion Matrix, ROC & Threshold Explorer

Pick a classifier quality, then drag the decision threshold across a dataset of scored predictions. The confusion matrix, precision, recall, specificity, F1, and the ROC marker all update instantly, so you can see exactly how moving the threshold trades false positives against false negatives. Useful for AP Statistics, intro data science, and machine learning courses.

Score distribution

56 examples · 28 positive · 28 negative
0.000.250.500.751.00model score (predict positive when score ≥ threshold)score 0.69 · actual positive · predicted positivescore 0.78 · actual positive · predicted positivescore 0.59 · actual positive · predicted positivescore 0.74 · actual positive · predicted positivescore 0.71 · actual positive · predicted positivescore 0.90 · actual positive · predicted positivescore 0.78 · actual positive · predicted positivescore 0.70 · actual positive · predicted positivescore 0.63 · actual positive · predicted positivescore 0.78 · actual positive · predicted positivescore 0.73 · actual positive · predicted positivescore 0.63 · actual positive · predicted positivescore 0.89 · actual positive · predicted positivescore 0.77 · actual positive · predicted positivescore 0.80 · actual positive · predicted positivescore 0.95 · actual positive · predicted positivescore 0.85 · actual positive · predicted positivescore 0.76 · actual positive · predicted positivescore 0.45 · actual positive · predicted negativescore 0.81 · actual positive · predicted positivescore 0.65 · actual positive · predicted positivescore 0.76 · actual positive · predicted positivescore 0.85 · actual positive · predicted positivescore 0.80 · actual positive · predicted positivescore 0.69 · actual positive · predicted positivescore 0.42 · actual positive · predicted negativescore 0.65 · actual positive · predicted positivescore 0.93 · actual positive · predicted positivescore 0.47 · actual negative · predicted negativescore 0.63 · actual negative · predicted positivescore 0.31 · actual negative · predicted negativescore 0.27 · actual negative · predicted negativescore 0.49 · actual negative · predicted negativescore 0.54 · actual negative · predicted positivescore 0.30 · actual negative · predicted negativescore 0.56 · actual negative · predicted positivescore 0.40 · actual negative · predicted negativescore 0.40 · actual negative · predicted negativescore 0.39 · actual negative · predicted negativescore 0.33 · actual negative · predicted negativescore 0.32 · actual negative · predicted negativescore 0.33 · actual negative · predicted negativescore 0.39 · actual negative · predicted negativescore 0.49 · actual negative · predicted negativescore 0.25 · actual negative · predicted negativescore 0.18 · actual negative · predicted negativescore 0.32 · actual negative · predicted negativescore 0.33 · actual negative · predicted negativescore 0.13 · actual negative · predicted negativescore 0.30 · actual negative · predicted negativescore 0.28 · actual negative · predicted negativescore 0.62 · actual negative · predicted positivescore 0.25 · actual negative · predicted negativescore 0.29 · actual negative · predicted negativescore 0.24 · actual negative · predicted negativescore 0.32 · actual negative · predicted negativethreshold 0.50
actual positive (class 1)actual negative (class 0)misclassified (dark outline)
Preset dataset

Positives score high and negatives score low with little overlap. The threshold can separate the classes cleanly.

Examples with a score at or above the threshold are predicted positive. Lower it to catch more positives, raise it to cut false alarms.

Confusion matrix

Actual class
Predicted class
Predicted positive
Predicted negative
Actual positive
TP
26
True positive
FN
2
False negative
Actual negative
FP
4
False positive
TN
24
True negative

Green cells are correct predictions. Red cells are errors. Move the threshold to trade false positives against false negatives.

Metrics at this threshold

Accuracy
Share of all predictions that are correct.
0.893
89.3%
Precision
Of predicted positives, the share that are truly positive.
0.867
86.7%
Recall (Sensitivity, TPR)
Of actual positives, the share the model catches.
0.929
92.9%
Specificity (TNR)
Of actual negatives, the share correctly ruled out.
0.857
85.7%
F1 score
Harmonic mean of precision and recall.
0.897
89.7%
False positive rate (FPR)
Of actual negatives, the share wrongly flagged positive.
0.143
14.3%

ROC curve

AUC 0.978
current threshold · FPR 0.143 · TPR 0.9290.000.000.250.250.500.500.750.751.001.00False positive rate (1 − specificity)True positive rate (recall)

The dashed diagonal is a random classifier (AUC 0.5). The orange dot marks the current threshold. A curve that hugs the top-left corner has the highest AUC.

Reading the tradeoff

Lowering the threshold predicts positive more often. That raises recall (you catch more true positives) but usually lowers precision (more false positives slip in). Raising the threshold does the reverse. There is rarely a single threshold that maximizes both at once, so the right choice depends on the cost of each error type.

The ROC curve traces every possible threshold as a point of false positive rate versus true positive rate. The area under it (AUC) is a single number from 0.5 (no better than chance) to 1.0 (perfect ranking) that summarizes the classifier across all thresholds, before you commit to any one of them.

Precision
TP / (TP + FP)
Recall
TP / (TP + FN)
F1
2 · P · R / (P + R)

Reference Guide

The confusion matrix

A binary classifier sorts every example into one of four buckets, defined by the true label and the predicted label.

  • True positive (TP). Actually positive, predicted positive.
  • False positive (FP). Actually negative, predicted positive. A false alarm.
  • False negative (FN). Actually positive, predicted negative. A miss.
  • True negative (TN). Actually negative, predicted negative.

Every metric in this tool is built from these four counts. The threshold decides which side of the line each example lands on.

Precision vs recall

Precision is TP / (TP + FP), the share of positive predictions that are correct. Recall is TP / (TP + FN), the share of real positives the model catches.

Lowering the threshold raises recall but tends to lower precision, because more examples get flagged positive. Raising it does the reverse. The F1 score is the harmonic mean of the two, which rewards a balance rather than one extreme.

On the imbalanced preset, precision drops fast even with good recall, because a small number of true positives is swamped by false positives from the large negative class.

The ROC curve and AUC

The ROC curve plots true positive rate against false positive rate at every threshold. Each point is one possible decision line. A curve that bends toward the top-left corner separates the classes well.

The area under the curve (AUC) collapses the whole curve into one number from 0.5 to 1.0. AUC 0.5 is the dashed diagonal of a coin flip. AUC near 1.0 means almost every positive scores above almost every negative.

Because AUC averages over all thresholds, it compares classifiers before you pick a threshold, which makes it handy for model selection.

Choosing a threshold for a real application

AUC measures ranking quality, but a deployed system still needs one threshold. The right value depends on which error costs more.

  • Disease screening. A missed case (FN) is costly, so favor recall with a lower threshold.
  • Spam filtering. A blocked real email (FP) annoys users, so favor precision with a higher threshold.
  • Fraud alerts. Analysts have limited time, so tune the threshold to the review capacity.

Move the slider and watch the confusion matrix to find the operating point that fits the costs of your problem.

Related Content