Search…

Bias, variance, and the tradeoff

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

Prerequisites: Linear regression.

Every prediction your model makes has error. That error comes from two sources: bias and variance. Understanding this tradeoff is the single most important concept for building models that actually work on new data.

Why does a good training score fail on new data?

Your model gets 99% accuracy on training data but 60% on test data. What went wrong? The model memorized the training examples instead of learning the underlying pattern. This is one of two failure modes, and telling them apart is the key to fixing your model.

Think of a dartboard. Each throw is a prediction your model makes.

  • High bias, low variance: every dart lands in the same wrong spot. Consistent but off-center. The model is too rigid to capture the true pattern.
  • Low bias, high variance: darts scatter all over the board, but their average is near the bullseye. The model is too sensitive to which data it trained on.
  • Low bias, low variance: darts cluster tightly around the bullseye. This is the goal.
  • High bias, high variance: darts scatter everywhere and their average is off-center. The worst case.

Bias-variance quadrants:

graph TD
  Center["Model Predictions"] --> LBLV["Low Bias + Low Variance<br/>IDEAL: accurate and consistent"]
  Center --> HBLV["High Bias + Low Variance<br/>UNDERFITTING: consistently wrong"]
  Center --> LBHV["Low Bias + High Variance<br/>OVERFITTING: scattered predictions"]
  Center --> HBHV["High Bias + High Variance<br/>WORST: wrong and scattered"]

Here is a concrete example. Fit three polynomial models to the same noisy data:

ModelDegreeTrain ErrorTest ErrorProblem
Line18.834.93Too simple, misses the curve (high bias)
Quadratic30.160.22Captures the pattern well
High polynomial100.0015.4Memorizes noise, wild swings (high variance)

The training error always drops as the model gets more complex. The test error drops at first, then shoots back up. That gap between training and test error tells you which problem you have.

Simple models have high bias (they cannot represent the true pattern) but low variance (they give stable predictions). Complex models have low bias (they can fit anything) but high variance (they change drastically with different training data). Your job is to find the sweet spot in between.

Now let’s formalize what we just described.

The setup

Suppose the true relationship between input xx and output yy is:

y=f(x)+ϵy = f(x) + \epsilon

where f(x)f(x) is some unknown function we’re trying to learn, and ϵ\epsilon is irreducible noise with mean 0 and variance σ2\sigma^2. No model can eliminate ϵ\epsilon. It’s randomness baked into the data.

We train a model f^(x)\hat{f}(x) on a training set. If we could train on many different training sets (all drawn from the same distribution), we’d get different f^\hat{f} each time. Some would be close to ff, others further away.

Defining bias and variance

Bias measures how far off the average prediction is from the true value:

Bias[f^(x)]=E[f^(x)]f(x)\text{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x)

High bias means the model consistently misses the target, no matter which training set you use. It’s systematically wrong. Think of a straight line trying to fit a curve.

Variance measures how much the predictions bounce around across different training sets:

Var[f^(x)]=E[(f^(x)E[f^(x)])2]\text{Var}[\hat{f}(x)] = \mathbb{E}\left[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\right]

High variance means the model is very sensitive to which specific data points are in the training set. Small changes in data cause big changes in predictions.

The decomposition

The expected prediction error at a point xx decomposes cleanly:

E[(yf^(x))2]=Bias[f^(x)]2+Var[f^(x)]+σ2\mathbb{E}\left[(y - \hat{f}(x))^2\right] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2

Three terms, each with a clear meaning:

TermWhat it meansCan you reduce it?
Bias2\text{Bias}^2Systematic error from wrong assumptionsYes, use a more flexible model
Var\text{Var}Sensitivity to training dataYes, use a simpler model or more data
σ2\sigma^2Irreducible noiseNo

The first two are in tension. Making the model more complex reduces bias but increases variance. Making it simpler reduces variance but increases bias. That’s the tradeoff.

Bias-variance decomposition flow:

graph LR
  Total["Total prediction error"] --> Bias["Bias squared:<br/>systematic error from<br/>wrong model assumptions"]
  Total --> Var["Variance:<br/>sensitivity to<br/>training data choice"]
  Total --> Noise["Irreducible noise:<br/>randomness baked<br/>into the data"]

Underfitting and overfitting

Underfitting (high bias, low variance): the model is too simple to capture the true pattern. A straight line fit to a quadratic curve. Training error is high. Test error is high. Both are bad.

Overfitting (low bias, high variance): the model is too complex and memorizes training noise. A degree-15 polynomial fit to 10 data points. Training error is near zero. Test error is huge. The model performs great on data it’s seen and terribly on data it hasn’t.

The sweet spot is somewhere in between: complex enough to capture the real pattern, simple enough to ignore the noise.

graph LR
  A[Simple Model] -->|High bias, Low variance| B[Underfitting]
  C[Complex Model] -->|Low bias, High variance| D[Overfitting]
  E[Right Complexity] -->|Balanced| F[Good Generalization]

Model complexity vs prediction error (U-shaped curve):

graph LR
  Low["Low complexity<br/>HIGH bias, LOW variance"] ---|"Error decreases as<br/>complexity grows"| Mid["Sweet spot:<br/>minimum total error"]
  Mid ---|"Error increases as<br/>complexity keeps growing"| High["High complexity<br/>LOW bias, HIGH variance"]

Model complexity vs prediction error

Example 1: polynomial fits to noisy data

Let’s make this concrete. The true function is f(x)=2+0.5x2f(x) = 2 + 0.5x^2, and we observe it with noise.

Generate 8 training points and 4 test points:

Training data (with noise):

xxytrue=2+0.5x2y_{\text{true}} = 2 + 0.5x^2yy (observed)
-36.57.1
-24.03.5
-12.52.8
02.01.6
12.52.9
24.04.4
36.56.0
410.010.8

Test data:

xxytruey_{\text{true}}yy (observed)
-2.55.1254.8
-0.52.1252.3
1.53.1253.4
3.58.1257.9

Three polynomial fits: underfitting, good fit, and overfitting

We’ll fit three models: degree 1 (line), degree 2 (quadratic), and degree 7 (high polynomial).

Degree 1 (underfitting)

A straight line: y^=w0+w1x\hat{y} = w_0 + w_1 x.

Using the training data, the best fit line is approximately:

y^3.14+0.94x\hat{y} \approx 3.14 + 0.94x

Training predictions and errors:

xxyyy^\hat{y}(yy^)2(y - \hat{y})^2
-37.10.3245.93
-23.51.265.02
-12.82.200.36
01.63.142.37
12.94.081.39
24.45.020.38
36.05.960.00
410.86.9015.21

Train MSE 8.83\approx 8.83

Test predictions:

xxyyy^\hat{y}(yy^)2(y - \hat{y})^2
-2.54.80.7916.08
-0.52.32.670.14
1.53.44.551.32
3.57.96.432.16

Test MSE 4.93\approx 4.93

Both errors are high. The line can’t capture the U-shape. This is underfitting.

Degree 2 (good fit)

A quadratic: y^=w0+w1x+w2x2\hat{y} = w_0 + w_1 x + w_2 x^2.

Since the true function is f(x)=2+0.5x2f(x) = 2 + 0.5x^2, the quadratic should nail it. The best fit on training data gives approximately:

y^1.89+0.08x+0.53x2\hat{y} \approx 1.89 + 0.08x + 0.53x^2

This is very close to the true function 2+0.5x22 + 0.5x^2.

Train MSE 0.16\approx 0.16

Test predictions:

xxyyy^\hat{y}(yy^)2(y - \hat{y})^2
-2.54.85.200.16
-0.52.31.980.10
1.53.43.210.04
3.57.98.670.59

Test MSE 0.22\approx 0.22

Both train and test errors are low and close to each other. Good generalization.

Degree 7 (overfitting)

A degree 7 polynomial has 8 parameters for 8 data points. It can pass through every training point exactly.

Train MSE =0.00= 0.00 (perfect fit)

But on test data, the polynomial swings wildly between the training points:

Test MSE 15.4\approx 15.4

The model memorized the training noise and created huge oscillations elsewhere.

Summary of the three fits

ModelTrain MSETest MSEDiagnosis
Degree 18.834.93Underfitting (high bias)
Degree 20.160.22Good fit
Degree 70.0015.4Overfitting (high variance)

The pattern: as complexity increases, training error always decreases. Test error decreases at first, then increases. The gap between training and test error is the key diagnostic.

Example 2: computing bias and variance numerically

Let’s compute bias and variance explicitly. We’ll evaluate at x=1.5x = 1.5 where f(1.5)=2+0.5(1.5)2=3.125f(1.5) = 2 + 0.5(1.5)^2 = 3.125.

Imagine training the degree-1 model on 5 different training sets (each with slightly different noise). We get 5 different lines, and each gives a prediction at x=1.5x = 1.5:

Training setf^(1.5)\hat{f}(1.5)
Set 14.55
Set 24.30
Set 34.68
Set 44.42
Set 54.50

Step 1: compute E[f^(1.5)]\mathbb{E}[\hat{f}(1.5)].

E[f^(1.5)]=4.55+4.30+4.68+4.42+4.505=22.455=4.49\mathbb{E}[\hat{f}(1.5)] = \frac{4.55 + 4.30 + 4.68 + 4.42 + 4.50}{5} = \frac{22.45}{5} = 4.49

Step 2: compute bias.

Bias=E[f^(1.5)]f(1.5)=4.493.125=1.365\text{Bias} = \mathbb{E}[\hat{f}(1.5)] - f(1.5) = 4.49 - 3.125 = 1.365

Bias2=1.3652=1.863\text{Bias}^2 = 1.365^2 = 1.863

The bias is large because a line systematically overshoots the true quadratic at x=1.5x = 1.5.

Step 3: compute variance.

Var=15i=15(f^i4.49)2\text{Var} = \frac{1}{5}\sum_{i=1}^{5}(\hat{f}_i - 4.49)^2

=(4.554.49)2+(4.304.49)2+(4.684.49)2+(4.424.49)2+(4.504.49)25= \frac{(4.55-4.49)^2 + (4.30-4.49)^2 + (4.68-4.49)^2 + (4.42-4.49)^2 + (4.50-4.49)^2}{5}

=0.0036+0.0361+0.0361+0.0049+0.00015= \frac{0.0036 + 0.0361 + 0.0361 + 0.0049 + 0.0001}{5}

=0.08085=0.0162= \frac{0.0808}{5} = 0.0162

Step 4: total error decomposition.

Assuming noise variance σ2=0.15\sigma^2 = 0.15:

Expected Error=Bias2+Var+σ2=1.863+0.016+0.15=2.029\text{Expected Error} = \text{Bias}^2 + \text{Var} + \sigma^2 = 1.863 + 0.016 + 0.15 = 2.029

For the degree-1 model: bias dominates. The model is too rigid.

Now repeat for the degree-7 model at the same point. Imagine 5 training sets give:

Training setf^(1.5)\hat{f}(1.5)
Set 12.80
Set 24.10
Set 31.90
Set 43.60
Set 53.15

E[f^(1.5)]=2.80+4.10+1.90+3.60+3.155=15.555=3.11\mathbb{E}[\hat{f}(1.5)] = \frac{2.80 + 4.10 + 1.90 + 3.60 + 3.15}{5} = \frac{15.55}{5} = 3.11

Bias: 3.113.125=0.0153.11 - 3.125 = -0.015, so Bias2=0.0002\text{Bias}^2 = 0.0002 (very small).

Variance:

=(2.803.11)2+(4.103.11)2+(1.903.11)2+(3.603.11)2+(3.153.11)25= \frac{(2.80-3.11)^2 + (4.10-3.11)^2 + (1.90-3.11)^2 + (3.60-3.11)^2 + (3.15-3.11)^2}{5}

=0.0961+0.9801+1.4641+0.2401+0.00165=2.7825=0.556= \frac{0.0961 + 0.9801 + 1.4641 + 0.2401 + 0.0016}{5} = \frac{2.782}{5} = 0.556

Total: 0.0002+0.556+0.15=0.7060.0002 + 0.556 + 0.15 = 0.706

For the degree-7 model: variance dominates. The model is too flexible and changes a lot with different training data.

The degree-2 model would have both bias and variance small, giving the lowest total error.

How to diagnose bias vs variance in practice

You won’t have access to multiple training sets. Instead, compare training error and test error:

PatternDiagnosisAction
High train error, high test errorUnderfitting (high bias)Use a more complex model, add features
Low train error, high test errorOverfitting (high variance)Regularize, get more data, simplify model
Low train error, low test errorGood fitShip it

Learning curves are your best diagnostic tool. Plot training error and validation error as a function of training set size:

  • High bias: both curves plateau at a high error. More data doesn’t help much.
  • High variance: training error is low, validation error is high, but the gap shrinks with more data.
from sklearn.model_selection import learning_curve
import numpy as np

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='neg_mean_squared_error'
)

# Plot -train_scores.mean(axis=1) and -val_scores.mean(axis=1)
# vs train_sizes to diagnose bias/variance

What to do about it

If bias is high (underfitting):

  • Use a more complex model (higher degree polynomial, more layers in a neural network)
  • Add more features or engineer better features
  • Reduce regularization strength

If variance is high (overfitting):

  • Get more training data
  • Use regularization (Ridge, Lasso)
  • Use simpler model
  • Use dropout (in neural networks)
  • Use ensemble methods (bagging reduces variance)

More data almost always helps with variance. It rarely helps with bias. This is why diagnosis matters: the fixes are opposite.

Summary

All prediction error comes from bias, variance, and irreducible noise. Simple models have high bias and low variance. Complex models have low bias and high variance. Your job is to find the complexity level where total error is minimized. Compare training and test error to diagnose which problem you have, then apply the right fix.

What comes next

The most common way to control variance without sacrificing too much bias is regularization. The next article covers Ridge, Lasso, and ElasticNet, three techniques that add a penalty to the loss function to keep weights small and models well-behaved.

Start typing to search across all content
navigate Enter open Esc close