Mar 26, 2026 · 16 min read · Machine Learning

Bias, variance, and the tradeoff

In this series (18 parts)

Prerequisites: Linear regression.

Every prediction your model makes has error. That error comes from two sources: bias and variance. Understanding this tradeoff is the single most important concept for building models that actually work on new data.

Why does a good training score fail on new data?

Your model gets 99% accuracy on training data but 60% on test data. What went wrong? The model memorized the training examples instead of learning the underlying pattern. This is one of two failure modes, and telling them apart is the key to fixing your model.

Think of a dartboard. Each throw is a prediction your model makes.

High bias, low variance: every dart lands in the same wrong spot. Consistent but off-center. The model is too rigid to capture the true pattern.
Low bias, high variance: darts scatter all over the board, but their average is near the bullseye. The model is too sensitive to which data it trained on.
Low bias, low variance: darts cluster tightly around the bullseye. This is the goal.
High bias, high variance: darts scatter everywhere and their average is off-center. The worst case.

Bias-variance quadrants:

graph TD
  Center["Model Predictions"] --> LBLV["Low Bias + Low Variance<br/>IDEAL: accurate and consistent"]
  Center --> HBLV["High Bias + Low Variance<br/>UNDERFITTING: consistently wrong"]
  Center --> LBHV["Low Bias + High Variance<br/>OVERFITTING: scattered predictions"]
  Center --> HBHV["High Bias + High Variance<br/>WORST: wrong and scattered"]

Here is a concrete example. Fit three polynomial models to the same noisy data:

Model	Degree	Train Error	Test Error	Problem
Line	1	8.83	4.93	Too simple, misses the curve (high bias)
Quadratic	3	0.16	0.22	Captures the pattern well
High polynomial	10	0.00	15.4	Memorizes noise, wild swings (high variance)

The training error always drops as the model gets more complex. The test error drops at first, then shoots back up. That gap between training and test error tells you which problem you have.

Simple models have high bias (they cannot represent the true pattern) but low variance (they give stable predictions). Complex models have low bias (they can fit anything) but high variance (they change drastically with different training data). Your job is to find the sweet spot in between.

Now let’s formalize what we just described.

The setup

Suppose the true relationship between input $x$ and output $y$ is:

$y = f(x) + \epsilon$

where $f(x)$ is some unknown function we’re trying to learn, and $\epsilon$ is irreducible noise with mean 0 and variance $\sigma^2$ . No model can eliminate $\epsilon$ . It’s randomness baked into the data.

We train a model $\hat{f}(x)$ on a training set. If we could train on many different training sets (all drawn from the same distribution), we’d get different $\hat{f}$ each time. Some would be close to $f$ , others further away.

Defining bias and variance

Bias measures how far off the average prediction is from the true value:

$\text{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x)$

High bias means the model consistently misses the target, no matter which training set you use. It’s systematically wrong. Think of a straight line trying to fit a curve.

Variance measures how much the predictions bounce around across different training sets:

$\text{Var}[\hat{f}(x)] = \mathbb{E}\left[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\right]$

High variance means the model is very sensitive to which specific data points are in the training set. Small changes in data cause big changes in predictions.

The decomposition

The expected prediction error at a point $x$ decomposes cleanly:

$\mathbb{E}\left[(y - \hat{f}(x))^2\right] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2$

Three terms, each with a clear meaning:

Term	What it means	Can you reduce it?
$\text{Bias}^2$	Systematic error from wrong assumptions	Yes, use a more flexible model
$\text{Var}$	Sensitivity to training data	Yes, use a simpler model or more data
$\sigma^2$	Irreducible noise	No

The first two are in tension. Making the model more complex reduces bias but increases variance. Making it simpler reduces variance but increases bias. That’s the tradeoff.

Bias-variance decomposition flow:

graph LR
  Total["Total prediction error"] --> Bias["Bias squared:<br/>systematic error from<br/>wrong model assumptions"]
  Total --> Var["Variance:<br/>sensitivity to<br/>training data choice"]
  Total --> Noise["Irreducible noise:<br/>randomness baked<br/>into the data"]

Underfitting and overfitting

Underfitting (high bias, low variance): the model is too simple to capture the true pattern. A straight line fit to a quadratic curve. Training error is high. Test error is high. Both are bad.

Overfitting (low bias, high variance): the model is too complex and memorizes training noise. A degree-15 polynomial fit to 10 data points. Training error is near zero. Test error is huge. The model performs great on data it’s seen and terribly on data it hasn’t.

The sweet spot is somewhere in between: complex enough to capture the real pattern, simple enough to ignore the noise.

graph LR
  A[Simple Model] -->|High bias, Low variance| B[Underfitting]
  C[Complex Model] -->|Low bias, High variance| D[Overfitting]
  E[Right Complexity] -->|Balanced| F[Good Generalization]

Model complexity vs prediction error (U-shaped curve):

graph LR
  Low["Low complexity<br/>HIGH bias, LOW variance"] ---|"Error decreases as<br/>complexity grows"| Mid["Sweet spot:<br/>minimum total error"]
  Mid ---|"Error increases as<br/>complexity keeps growing"| High["High complexity<br/>LOW bias, HIGH variance"]

Model complexity vs prediction error

Example 1: polynomial fits to noisy data

Let’s make this concrete. The true function is $f(x) = 2 + 0.5x^2$ , and we observe it with noise.

Generate 8 training points and 4 test points:

Training data (with noise):

$x$	$y_{\text{true}} = 2 + 0.5x^2$	$y$ (observed)
-3	6.5	7.1
-2	4.0	3.5
-1	2.5	2.8
0	2.0	1.6
1	2.5	2.9
2	4.0	4.4
3	6.5	6.0
4	10.0	10.8

Test data:

$x$	$y_{\text{true}}$	$y$ (observed)
-2.5	5.125	4.8
-0.5	2.125	2.3
1.5	3.125	3.4
3.5	8.125	7.9

Three polynomial fits: underfitting, good fit, and overfitting

We’ll fit three models: degree 1 (line), degree 2 (quadratic), and degree 7 (high polynomial).

Degree 1 (underfitting)

A straight line: $\hat{y} = w_0 + w_1 x$ .

Using the training data, the best fit line is approximately:

$\hat{y} \approx 3.14 + 0.94x$

Training predictions and errors:

$x$	$y$	$\hat{y}$	$(y - \hat{y})^2$
-3	7.1	0.32	45.93
-2	3.5	1.26	5.02
-1	2.8	2.20	0.36
0	1.6	3.14	2.37
1	2.9	4.08	1.39
2	4.4	5.02	0.38
3	6.0	5.96	0.00
4	10.8	6.90	15.21

Train MSE $\approx 8.83$

Test predictions:

$x$	$y$	$\hat{y}$	$(y - \hat{y})^2$
-2.5	4.8	0.79	16.08
-0.5	2.3	2.67	0.14
1.5	3.4	4.55	1.32
3.5	7.9	6.43	2.16

Test MSE $\approx 4.93$

Both errors are high. The line can’t capture the U-shape. This is underfitting.

Degree 2 (good fit)

A quadratic: $\hat{y} = w_0 + w_1 x + w_2 x^2$ .

Since the true function is $f(x) = 2 + 0.5x^2$ , the quadratic should nail it. The best fit on training data gives approximately:

$\hat{y} \approx 1.89 + 0.08x + 0.53x^2$

This is very close to the true function $2 + 0.5x^2$ .

Train MSE $\approx 0.16$

Test predictions:

$x$	$y$	$\hat{y}$	$(y - \hat{y})^2$
-2.5	4.8	5.20	0.16
-0.5	2.3	1.98	0.10
1.5	3.4	3.21	0.04
3.5	7.9	8.67	0.59

Test MSE $\approx 0.22$

Both train and test errors are low and close to each other. Good generalization.

Degree 7 (overfitting)

A degree 7 polynomial has 8 parameters for 8 data points. It can pass through every training point exactly.

Train MSE $= 0.00$ (perfect fit)

But on test data, the polynomial swings wildly between the training points:

Test MSE $\approx 15.4$

The model memorized the training noise and created huge oscillations elsewhere.

Summary of the three fits

Model	Train MSE	Test MSE	Diagnosis
Degree 1	8.83	4.93	Underfitting (high bias)
Degree 2	0.16	0.22	Good fit
Degree 7	0.00	15.4	Overfitting (high variance)

The pattern: as complexity increases, training error always decreases. Test error decreases at first, then increases. The gap between training and test error is the key diagnostic.

Example 2: computing bias and variance numerically

Let’s compute bias and variance explicitly. We’ll evaluate at $x = 1.5$ where $f(1.5) = 2 + 0.5(1.5)^2 = 3.125$ .

Imagine training the degree-1 model on 5 different training sets (each with slightly different noise). We get 5 different lines, and each gives a prediction at $x = 1.5$ :

Training set	$\hat{f}(1.5)$
Set 1	4.55
Set 2	4.30
Set 3	4.68
Set 4	4.42
Set 5	4.50

Step 1: compute $\mathbb{E}[\hat{f}(1.5)]$ .

$\mathbb{E}[\hat{f}(1.5)] = \frac{4.55 + 4.30 + 4.68 + 4.42 + 4.50}{5} = \frac{22.45}{5} = 4.49$

Step 2: compute bias.

$\text{Bias} = \mathbb{E}[\hat{f}(1.5)] - f(1.5) = 4.49 - 3.125 = 1.365$

$\text{Bias}^2 = 1.365^2 = 1.863$

The bias is large because a line systematically overshoots the true quadratic at $x = 1.5$ .

Step 3: compute variance.

$\text{Var} = \frac{1}{5}\sum_{i=1}^{5}(\hat{f}_i - 4.49)^2$

$= \frac{(4.55-4.49)^2 + (4.30-4.49)^2 + (4.68-4.49)^2 + (4.42-4.49)^2 + (4.50-4.49)^2}{5}$

$= \frac{0.0036 + 0.0361 + 0.0361 + 0.0049 + 0.0001}{5}$

$= \frac{0.0808}{5} = 0.0162$

Step 4: total error decomposition.

Assuming noise variance $\sigma^2 = 0.15$ :

$\text{Expected Error} = \text{Bias}^2 + \text{Var} + \sigma^2 = 1.863 + 0.016 + 0.15 = 2.029$

For the degree-1 model: bias dominates. The model is too rigid.

Now repeat for the degree-7 model at the same point. Imagine 5 training sets give:

Training set	$\hat{f}(1.5)$
Set 1	2.80
Set 2	4.10
Set 3	1.90
Set 4	3.60
Set 5	3.15

$\mathbb{E}[\hat{f}(1.5)] = \frac{2.80 + 4.10 + 1.90 + 3.60 + 3.15}{5} = \frac{15.55}{5} = 3.11$

Bias: $3.11 - 3.125 = -0.015$ , so $\text{Bias}^2 = 0.0002$ (very small).

Variance:

$= \frac{(2.80-3.11)^2 + (4.10-3.11)^2 + (1.90-3.11)^2 + (3.60-3.11)^2 + (3.15-3.11)^2}{5}$

$= \frac{0.0961 + 0.9801 + 1.4641 + 0.2401 + 0.0016}{5} = \frac{2.782}{5} = 0.556$

Total: $0.0002 + 0.556 + 0.15 = 0.706$

For the degree-7 model: variance dominates. The model is too flexible and changes a lot with different training data.

The degree-2 model would have both bias and variance small, giving the lowest total error.

How to diagnose bias vs variance in practice

You won’t have access to multiple training sets. Instead, compare training error and test error:

Pattern	Diagnosis	Action
High train error, high test error	Underfitting (high bias)	Use a more complex model, add features
Low train error, high test error	Overfitting (high variance)	Regularize, get more data, simplify model
Low train error, low test error	Good fit	Ship it

Learning curves are your best diagnostic tool. Plot training error and validation error as a function of training set size:

High bias: both curves plateau at a high error. More data doesn’t help much.
High variance: training error is low, validation error is high, but the gap shrinks with more data.

from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='neg_mean_squared_error'
)

# Plot -train_scores.mean(axis=1) and -val_scores.mean(axis=1)
# vs train_sizes to diagnose bias/variance

What to do about it

If bias is high (underfitting):

Use a more complex model (higher degree polynomial, more layers in a neural network)
Add more features or engineer better features
Reduce regularization strength

If variance is high (overfitting):

Get more training data
Use regularization (Ridge, Lasso)
Use simpler model
Use dropout (in neural networks)
Use ensemble methods (bagging reduces variance)

More data almost always helps with variance. It rarely helps with bias. This is why diagnosis matters: the fixes are opposite.

Summary

All prediction error comes from bias, variance, and irreducible noise. Simple models have high bias and low variance. Complex models have low bias and high variance. Your job is to find the complexity level where total error is minimized. Compare training and test error to diagnose which problem you have, then apply the right fix.

What comes next

The most common way to control variance without sacrificing too much bias is regularization. The next article covers Ridge, Lasso, and ElasticNet, three techniques that add a penalty to the loss function to keep weights small and models well-behaved.

← Back to all series