Mar 29, 2026 · 16 min read · Machine Learning

Evaluation metrics for classification

In this series (18 parts)

A classifier that gets 95% accuracy sounds great until you realize it could be doing nothing useful at all. Evaluation metrics exist because a single number rarely tells the full story. You need to know what kinds of mistakes your model is making, not just how many.

This article covers the standard toolkit: the confusion matrix, accuracy, precision, recall, F1, ROC-AUC, and how these extend to multi-class problems. By the end, you will know which metric to pick for any given problem and, more importantly, why.

The confusion matrix

Every classification metric starts here. The confusion matrix is a 2x2 table (for binary classification) that breaks down your model’s predictions into four buckets:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

True Positive (TP): The model said “positive” and it was right.

True Negative (TN): The model said “negative” and it was right.

False Positive (FP): The model said “positive” but it was wrong. Sometimes called a Type I error, or a false alarm.

False Negative (FN): The model said “negative” but it was wrong. Sometimes called a Type II error, or a miss.

The names make sense if you read them as adjectives on the prediction. “False Positive” means the model predicted positive, and that prediction was false.

Why does this matter? Because different applications care about different quadrants. A spam filter that lets spam through (FN) is annoying. A spam filter that blocks your important emails (FP) is dangerous. The confusion matrix forces you to look at both failure modes separately.

from sklearn.metrics import confusion_matrix

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[4, 1],
#  [1, 4]]
# Row 0 = actual negative, Row 1 = actual positive
# TN=4, FP=1, FN=1, TP=4

Accuracy

Accuracy is the simplest metric. It measures the fraction of predictions your model got right:

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

If you have 100 samples and your model classifies 90 correctly, accuracy is 0.90. Straightforward.

When accuracy works: balanced datasets where positive and negative classes are roughly equal in size, and where both types of errors are equally costly.

When accuracy fails: imbalanced datasets. This is the big one. If 95% of your data belongs to one class, a model that always predicts that class gets 95% accuracy while being completely useless. We will work through a concrete example of this below.

Precision

Precision answers: “Of all the items the model flagged as positive, how many actually were positive?”

$\text{Precision} = \frac{TP}{TP + FP}$

High precision means few false alarms. When your model says “positive,” you can trust it.

When to optimize for precision: when false positives are expensive. Examples include a system recommending surgery (you do not want unnecessary operations), or a search engine marking pages as malicious (blocking a legitimate site is bad for business).

Recall (Sensitivity)

Recall answers: “Of all the items that were actually positive, how many did the model find?”

$\text{Recall} = \frac{TP}{TP + FN}$

High recall means the model rarely misses a positive case. It catches most of what it should.

When to optimize for recall: when false negatives are expensive. Cancer screening is the classic example. Missing a cancer diagnosis (FN) is far worse than sending a healthy patient for additional tests (FP). Fraud detection is another. You would rather flag some legitimate transactions than let fraudulent ones slip through.

Worked example 1: computing all metrics from a confusion matrix

A spam classifier produces the following results on a test set of 100 emails:

$TP = 40$ (spam emails correctly caught)
$FP = 10$ (legitimate emails wrongly flagged as spam)
$FN = 5$ (spam emails that got through)
$TN = 45$ (legitimate emails correctly delivered)

Let’s compute every metric step by step.

Accuracy:

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{40 + 45}{40 + 45 + 10 + 5} = \frac{85}{100} = 0.85$

The model gets 85% of predictions right overall.

Precision:

$\text{Precision} = \frac{TP}{TP + FP} = \frac{40}{40 + 10} = \frac{40}{50} = 0.80$

When the model says “spam,” it is correct 80% of the time. One in five flagged emails is actually legitimate.

Recall:

$\text{Recall} = \frac{TP}{TP + FN} = \frac{40}{40 + 5} = \frac{40}{45} \approx 0.889$

The model catches about 89% of actual spam. Roughly 1 in 9 spam emails slips through.

F1 Score (we will define this properly in the next section):

$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \cdot \frac{0.80 \times 0.889}{0.80 + 0.889} = 2 \cdot \frac{0.7112}{1.689} \approx 0.842$

So our spam classifier has decent recall (catches most spam) but lower precision (sometimes blocks good emails). Whether this tradeoff is acceptable depends on your users. Most people would rather see occasional spam than lose important emails, so you might want to push precision higher even if recall drops a bit.

F1 score

The F1 score combines precision and recall into one number using the harmonic mean:

$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Why the harmonic mean and not the arithmetic mean? The arithmetic mean of 0.0 and 1.0 is 0.5, which sounds decent. But a classifier with 0% precision (or 0% recall) is useless. The harmonic mean of 0.0 and 1.0 is 0.0, which correctly reflects that. The harmonic mean punishes extreme imbalances between precision and recall. Both need to be reasonably high for F1 to be high.

More generally, the $F_\beta$ score lets you weight precision and recall differently:

$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}$

When $\beta = 1$ , you get the standard F1. When $\beta = 2$ , recall is weighted twice as much as precision (useful in medical screening). When $\beta = 0.5$ , precision gets more weight.

Worked example 2: when accuracy misleads

Consider a disease screening test. Out of 1000 patients, 50 actually have the disease and 950 are healthy. Your model just predicts “healthy” for everyone. No machine learning, no features, just the word “healthy” every single time.

The confusion matrix looks like this:

$TP = 0$ (caught zero sick patients)
$FP = 0$ (never said anyone was sick)
$FN = 50$ (missed all 50 sick patients)
$TN = 950$ (correctly labeled all healthy patients)

Accuracy:

$\text{Accuracy} = \frac{0 + 950}{0 + 950 + 0 + 50} = \frac{950}{1000} = 0.95$

95% accuracy. Looks impressive on a dashboard.

Precision:

$\text{Precision} = \frac{TP}{TP + FP} = \frac{0}{0 + 0}$

Undefined (the model never predicted positive). By convention, we set this to 0.

Recall:

$\text{Recall} = \frac{TP}{TP + FN} = \frac{0}{0 + 50} = 0.0$

Zero. The model missed every single sick patient.

F1:

$F_1 = 2 \cdot \frac{0 \cdot 0}{0 + 0} = 0.0$

This is why accuracy alone is dangerous with imbalanced classes. A 95% accuracy score hides the fact that the model is completely failing at its actual job. Precision, recall, and F1 all correctly read as zero, revealing the problem immediately.

In practice, most interesting classification problems have some degree of class imbalance. Fraud detection (rare fraud events), medical diagnosis (rare diseases), anomaly detection. This is also related to the bias-variance tradeoff: a model that always predicts the majority class has extreme bias.

The precision-recall tradeoff

Most classifiers do not output a hard “yes” or “no.” They output a probability or score, and you choose a threshold to convert that into a decision. For instance, a logistic regression might output 0.73 for a given email, and you classify it as spam if the score exceeds 0.5.

Moving the threshold changes the balance between precision and recall:

Raise the threshold (e.g., 0.5 to 0.8): the model becomes pickier about what it calls positive. Fewer false positives, so precision goes up. But it will also miss more true positives, so recall goes down.
Lower the threshold (e.g., 0.5 to 0.2): the model flags more items as positive. It catches more true positives, so recall goes up. But it also catches more false positives, so precision drops.

You can visualize this by plotting precision vs. recall at every possible threshold. This is the precision-recall curve. A perfect classifier hugs the top-right corner (high precision and high recall). A useless one sits near the bottom.

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# y_scores are the model's predicted probabilities
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.show()

There is no free lunch here. You cannot have perfect precision and perfect recall unless your model is perfect. The right threshold depends on the cost of each error type in your specific application.

ROC curve and AUC

The ROC (Receiver Operating Characteristic) curve is another way to evaluate threshold-based classifiers. Instead of precision vs. recall, it plots:

True Positive Rate (TPR) on the y-axis, which is the same as recall:

$TPR = \frac{TP}{TP + FN}$

False Positive Rate (FPR) on the x-axis:

$FPR = \frac{FP}{FP + TN}$

You sweep across all possible thresholds, compute TPR and FPR at each one, and plot the curve.

How to read it:

A perfect classifier goes straight up to (0, 1) and then across to (1, 1). It achieves 100% TPR with 0% FPR.
A random classifier (coin flip) follows the diagonal from (0, 0) to (1, 1).
A model below the diagonal is worse than random. (Just flip its predictions.)

AUC (Area Under the Curve) summarizes the ROC curve as a single number between 0 and 1.

$AUC = 1.0$ : perfect classifier.
$AUC = 0.5$ : random guessing.
$AUC < 0.5$ : worse than random (flip the labels).

AUC has a nice probabilistic interpretation: it is the probability that the model ranks a randomly chosen positive example higher than a randomly chosen negative example.

from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)

plt.plot(fpr, tpr, label=f"AUC = {auc:.3f}")
plt.plot([0, 1], [0, 1], "k--", label="Random")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

ROC-AUC vs. precision-recall: when to use which

ROC-AUC can be misleading with heavily imbalanced datasets. When the negative class is very large, even a small FPR translates to a huge number of false positives. The ROC curve can still look good because FPR stays low in relative terms.

The precision-recall curve is more informative in imbalanced settings because precision directly measures the quality of positive predictions, which is usually what you care about.

Rule of thumb: use precision-recall when the positive class is rare. Use ROC-AUC when classes are roughly balanced or when you need a threshold-independent comparison across models.

Log loss

While not a confusion-matrix metric, cross-entropy loss (also called log loss) is the standard loss function for training classifiers and deserves a mention here. Instead of evaluating hard predictions, it evaluates the predicted probabilities directly:

$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]$

Where $y_i$ is the true label (0 or 1) and $\hat{p}_i$ is the predicted probability for class 1.

Log loss penalizes confident wrong predictions heavily. A model that says 0.99 for a sample that is actually 0 gets hammered. This makes it useful during training and model selection, even if you report precision/recall/F1 as your final evaluation metrics.

Multi-class extensions

Everything above assumes binary classification. When you have more than two classes, you need a strategy for aggregating per-class metrics into a single number.

Per-class metrics

First, compute precision, recall, and F1 for each class independently using a one-vs-rest approach. For class $c$ , treat every sample of class $c$ as positive and everything else as negative.

Macro averaging

Compute the metric for each class, then take the unweighted mean:

$\text{Macro Precision} = \frac{1}{C} \sum_{c=1}^{C} \text{Precision}_c$

This treats every class equally, regardless of size. Good when all classes matter equally.

Micro averaging

Pool all the TP, FP, and FN counts across classes, then compute the metric from the pooled counts:

$\text{Micro Precision} = \frac{\sum_{c} TP_c}{\sum_{c} TP_c + \sum_{c} FP_c}$

This is dominated by the most common classes. In fact, for a single-label multi-class problem, micro-averaged precision, recall, and F1 all equal accuracy.

Weighted averaging

Like macro, but weight each class’s metric by its support (number of true samples in that class):

$\text{Weighted Precision} = \frac{1}{N} \sum_{c=1}^{C} n_c \cdot \text{Precision}_c$

This is a compromise. It accounts for class imbalance without being fully dominated by the majority class.

from sklearn.metrics import classification_report

# Multi-class predictions
y_true = [0, 1, 2, 2, 1, 0, 2, 1, 0, 2]
y_pred = [0, 2, 2, 2, 1, 0, 0, 1, 0, 2]

print(classification_report(y_true, y_pred, target_names=["cat", "dog", "bird"]))

This prints per-class precision, recall, F1, and all three averaging strategies in one table. Use it.

Three classifiers compared across Precision, Recall, and F1. Model C dominates, while A trades recall for precision and B does the opposite.

Choosing the right metric

There is no universal “best” metric. The right choice depends on your problem:

Situation	Recommended metric
Balanced classes, equal error costs	Accuracy or F1
Imbalanced classes	Precision, Recall, F1, or PR-AUC
False positives are costly	Precision
False negatives are costly	Recall
Need a single threshold-free score	ROC-AUC (balanced) or PR-AUC (imbalanced)
Evaluating predicted probabilities	Log loss
Multi-class with equal class importance	Macro F1
Multi-class with unequal class sizes	Weighted F1

Before you pick a metric, ask: “What is the cost of each type of mistake in this application?” That question should drive everything.

A related consideration is Bayes theorem. In medical testing, even a highly accurate test can have a low positive predictive value (which is just precision by another name) if the base rate of the disease is very low. The prior probability of the positive class matters.

Common pitfalls

⚠ Reporting only accuracy on an imbalanced dataset. Always check the confusion matrix first.

⚠ Optimizing for F1 when your business problem is asymmetric. If false negatives cost 100x more than false positives, use $F_2$ or directly optimize recall with a precision constraint.

⚠ Comparing AUC across datasets. AUC depends on the difficulty of the dataset, so a model with AUC 0.90 on one dataset is not necessarily better than a model with AUC 0.85 on a harder dataset.

⚠ Ignoring calibration. A model can have great AUC but poorly calibrated probabilities. If you need the predicted probabilities to mean something (e.g., “there is a 70% chance this transaction is fraud”), evaluate calibration separately.

⚠ Tuning the threshold on the test set. Pick your threshold on a validation set, then evaluate on the test set exactly once. Otherwise you are just overfitting to the test data.

What comes next

We have covered how to measure a classifier’s performance. You now know what precision, recall, F1, and ROC-AUC mean, when each one is appropriate, and how accuracy can deceive you in imbalanced settings.

Next up, we put these metrics to work. In the next article on the Naive Bayes classifier, we build a probabilistic classifier from scratch and evaluate it using the tools from this article.

← Back to all series