Bias, variance, and the tradeoff
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
Prerequisites: Linear regression.
Every prediction your model makes has error. That error comes from two sources: bias and variance. Understanding this tradeoff is the single most important concept for building models that actually work on new data.
Why does a good training score fail on new data?
Your model gets 99% accuracy on training data but 60% on test data. What went wrong? The model memorized the training examples instead of learning the underlying pattern. This is one of two failure modes, and telling them apart is the key to fixing your model.
Think of a dartboard. Each throw is a prediction your model makes.
- High bias, low variance: every dart lands in the same wrong spot. Consistent but off-center. The model is too rigid to capture the true pattern.
- Low bias, high variance: darts scatter all over the board, but their average is near the bullseye. The model is too sensitive to which data it trained on.
- Low bias, low variance: darts cluster tightly around the bullseye. This is the goal.
- High bias, high variance: darts scatter everywhere and their average is off-center. The worst case.
Bias-variance quadrants:
graph TD Center["Model Predictions"] --> LBLV["Low Bias + Low Variance<br/>IDEAL: accurate and consistent"] Center --> HBLV["High Bias + Low Variance<br/>UNDERFITTING: consistently wrong"] Center --> LBHV["Low Bias + High Variance<br/>OVERFITTING: scattered predictions"] Center --> HBHV["High Bias + High Variance<br/>WORST: wrong and scattered"]
Here is a concrete example. Fit three polynomial models to the same noisy data:
| Model | Degree | Train Error | Test Error | Problem |
|---|---|---|---|---|
| Line | 1 | 8.83 | 4.93 | Too simple, misses the curve (high bias) |
| Quadratic | 3 | 0.16 | 0.22 | Captures the pattern well |
| High polynomial | 10 | 0.00 | 15.4 | Memorizes noise, wild swings (high variance) |
The training error always drops as the model gets more complex. The test error drops at first, then shoots back up. That gap between training and test error tells you which problem you have.
Simple models have high bias (they cannot represent the true pattern) but low variance (they give stable predictions). Complex models have low bias (they can fit anything) but high variance (they change drastically with different training data). Your job is to find the sweet spot in between.
Now let’s formalize what we just described.
The setup
Suppose the true relationship between input and output is:
where is some unknown function we’re trying to learn, and is irreducible noise with mean 0 and variance . No model can eliminate . It’s randomness baked into the data.
We train a model on a training set. If we could train on many different training sets (all drawn from the same distribution), we’d get different each time. Some would be close to , others further away.
Defining bias and variance
Bias measures how far off the average prediction is from the true value:
High bias means the model consistently misses the target, no matter which training set you use. It’s systematically wrong. Think of a straight line trying to fit a curve.
Variance measures how much the predictions bounce around across different training sets:
High variance means the model is very sensitive to which specific data points are in the training set. Small changes in data cause big changes in predictions.
The decomposition
The expected prediction error at a point decomposes cleanly:
Three terms, each with a clear meaning:
| Term | What it means | Can you reduce it? |
|---|---|---|
| Systematic error from wrong assumptions | Yes, use a more flexible model | |
| Sensitivity to training data | Yes, use a simpler model or more data | |
| Irreducible noise | No |
The first two are in tension. Making the model more complex reduces bias but increases variance. Making it simpler reduces variance but increases bias. That’s the tradeoff.
Bias-variance decomposition flow:
graph LR Total["Total prediction error"] --> Bias["Bias squared:<br/>systematic error from<br/>wrong model assumptions"] Total --> Var["Variance:<br/>sensitivity to<br/>training data choice"] Total --> Noise["Irreducible noise:<br/>randomness baked<br/>into the data"]
Underfitting and overfitting
Underfitting (high bias, low variance): the model is too simple to capture the true pattern. A straight line fit to a quadratic curve. Training error is high. Test error is high. Both are bad.
Overfitting (low bias, high variance): the model is too complex and memorizes training noise. A degree-15 polynomial fit to 10 data points. Training error is near zero. Test error is huge. The model performs great on data it’s seen and terribly on data it hasn’t.
The sweet spot is somewhere in between: complex enough to capture the real pattern, simple enough to ignore the noise.
graph LR A[Simple Model] -->|High bias, Low variance| B[Underfitting] C[Complex Model] -->|Low bias, High variance| D[Overfitting] E[Right Complexity] -->|Balanced| F[Good Generalization]
Model complexity vs prediction error (U-shaped curve):
graph LR Low["Low complexity<br/>HIGH bias, LOW variance"] ---|"Error decreases as<br/>complexity grows"| Mid["Sweet spot:<br/>minimum total error"] Mid ---|"Error increases as<br/>complexity keeps growing"| High["High complexity<br/>LOW bias, HIGH variance"]
Model complexity vs prediction error
Example 1: polynomial fits to noisy data
Let’s make this concrete. The true function is , and we observe it with noise.
Generate 8 training points and 4 test points:
Training data (with noise):
| (observed) | ||
|---|---|---|
| -3 | 6.5 | 7.1 |
| -2 | 4.0 | 3.5 |
| -1 | 2.5 | 2.8 |
| 0 | 2.0 | 1.6 |
| 1 | 2.5 | 2.9 |
| 2 | 4.0 | 4.4 |
| 3 | 6.5 | 6.0 |
| 4 | 10.0 | 10.8 |
Test data:
| (observed) | ||
|---|---|---|
| -2.5 | 5.125 | 4.8 |
| -0.5 | 2.125 | 2.3 |
| 1.5 | 3.125 | 3.4 |
| 3.5 | 8.125 | 7.9 |
Three polynomial fits: underfitting, good fit, and overfitting
We’ll fit three models: degree 1 (line), degree 2 (quadratic), and degree 7 (high polynomial).
Degree 1 (underfitting)
A straight line: .
Using the training data, the best fit line is approximately:
Training predictions and errors:
| -3 | 7.1 | 0.32 | 45.93 |
| -2 | 3.5 | 1.26 | 5.02 |
| -1 | 2.8 | 2.20 | 0.36 |
| 0 | 1.6 | 3.14 | 2.37 |
| 1 | 2.9 | 4.08 | 1.39 |
| 2 | 4.4 | 5.02 | 0.38 |
| 3 | 6.0 | 5.96 | 0.00 |
| 4 | 10.8 | 6.90 | 15.21 |
Train MSE
Test predictions:
| -2.5 | 4.8 | 0.79 | 16.08 |
| -0.5 | 2.3 | 2.67 | 0.14 |
| 1.5 | 3.4 | 4.55 | 1.32 |
| 3.5 | 7.9 | 6.43 | 2.16 |
Test MSE
Both errors are high. The line can’t capture the U-shape. This is underfitting.
Degree 2 (good fit)
A quadratic: .
Since the true function is , the quadratic should nail it. The best fit on training data gives approximately:
This is very close to the true function .
Train MSE
Test predictions:
| -2.5 | 4.8 | 5.20 | 0.16 |
| -0.5 | 2.3 | 1.98 | 0.10 |
| 1.5 | 3.4 | 3.21 | 0.04 |
| 3.5 | 7.9 | 8.67 | 0.59 |
Test MSE
Both train and test errors are low and close to each other. Good generalization.
Degree 7 (overfitting)
A degree 7 polynomial has 8 parameters for 8 data points. It can pass through every training point exactly.
Train MSE (perfect fit)
But on test data, the polynomial swings wildly between the training points:
Test MSE
The model memorized the training noise and created huge oscillations elsewhere.
Summary of the three fits
| Model | Train MSE | Test MSE | Diagnosis |
|---|---|---|---|
| Degree 1 | 8.83 | 4.93 | Underfitting (high bias) |
| Degree 2 | 0.16 | 0.22 | Good fit |
| Degree 7 | 0.00 | 15.4 | Overfitting (high variance) |
The pattern: as complexity increases, training error always decreases. Test error decreases at first, then increases. The gap between training and test error is the key diagnostic.
Example 2: computing bias and variance numerically
Let’s compute bias and variance explicitly. We’ll evaluate at where .
Imagine training the degree-1 model on 5 different training sets (each with slightly different noise). We get 5 different lines, and each gives a prediction at :
| Training set | |
|---|---|
| Set 1 | 4.55 |
| Set 2 | 4.30 |
| Set 3 | 4.68 |
| Set 4 | 4.42 |
| Set 5 | 4.50 |
Step 1: compute .
Step 2: compute bias.
The bias is large because a line systematically overshoots the true quadratic at .
Step 3: compute variance.
Step 4: total error decomposition.
Assuming noise variance :
For the degree-1 model: bias dominates. The model is too rigid.
Now repeat for the degree-7 model at the same point. Imagine 5 training sets give:
| Training set | |
|---|---|
| Set 1 | 2.80 |
| Set 2 | 4.10 |
| Set 3 | 1.90 |
| Set 4 | 3.60 |
| Set 5 | 3.15 |
Bias: , so (very small).
Variance:
Total:
For the degree-7 model: variance dominates. The model is too flexible and changes a lot with different training data.
The degree-2 model would have both bias and variance small, giving the lowest total error.
How to diagnose bias vs variance in practice
You won’t have access to multiple training sets. Instead, compare training error and test error:
| Pattern | Diagnosis | Action |
|---|---|---|
| High train error, high test error | Underfitting (high bias) | Use a more complex model, add features |
| Low train error, high test error | Overfitting (high variance) | Regularize, get more data, simplify model |
| Low train error, low test error | Good fit | Ship it |
Learning curves are your best diagnostic tool. Plot training error and validation error as a function of training set size:
- High bias: both curves plateau at a high error. More data doesn’t help much.
- High variance: training error is low, validation error is high, but the gap shrinks with more data.
from sklearn.model_selection import learning_curve
import numpy as np
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='neg_mean_squared_error'
)
# Plot -train_scores.mean(axis=1) and -val_scores.mean(axis=1)
# vs train_sizes to diagnose bias/variance
What to do about it
If bias is high (underfitting):
- Use a more complex model (higher degree polynomial, more layers in a neural network)
- Add more features or engineer better features
- Reduce regularization strength
If variance is high (overfitting):
- Get more training data
- Use regularization (Ridge, Lasso)
- Use simpler model
- Use dropout (in neural networks)
- Use ensemble methods (bagging reduces variance)
More data almost always helps with variance. It rarely helps with bias. This is why diagnosis matters: the fixes are opposite.
Summary
All prediction error comes from bias, variance, and irreducible noise. Simple models have high bias and low variance. Complex models have low bias and high variance. Your job is to find the complexity level where total error is minimized. Compare training and test error to diagnose which problem you have, then apply the right fix.
What comes next
The most common way to control variance without sacrificing too much bias is regularization. The next article covers Ridge, Lasso, and ElasticNet, three techniques that add a penalty to the loss function to keep weights small and models well-behaved.