Search…

Linear regression

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

Prerequisites: Data, features, and the ML pipeline and Partial derivatives and gradients.

Linear regression is the simplest ML model and one of the most useful. You’re fitting a straight line (or hyperplane) through data to predict a continuous target. It’s the starting point for nearly everything else in this series.

Why linear regression?

You want to predict the selling price of a house. You have data on past sales - size, number of bedrooms, and the final price. Can you use that history to price a new listing?

Here are six recent sales in your neighborhood:

HouseSize (sqft)BedroomsPrice ($k)
A1,2002250
B1,5003310
C1,8003370
D2,1004420
E2,4004480
F3,0005580

House prices: data points with best-fit regression line

Bigger houses cost more. More bedrooms cost more. The relationship looks roughly linear: each extra square foot adds a fairly consistent dollar amount to the price.

Linear regression draws the best-fit line through this data. Plug in the features of a new house and out comes a predicted price. The model assigns a weight to each feature (how much that feature matters) and adds a constant (the baseline price). The prediction is just a weighted sum.

Linear regression pipeline:

graph LR
  A["Input features:<br/>size, bedrooms"] --> B["Model:<br/>weighted sum + bias"]
  B --> C["Predicted price"]
  C --> D["Compare to<br/>actual price"]
  D --> E["Adjust weights<br/>to reduce error"]
  E --> B

Fitting a line through data:

graph TD
  subgraph data["Observed sales: size vs price"]
      P1["1200 sqft, $250k"]
      P2["1500 sqft, $310k"]
      P3["1800 sqft, $370k"]
      P4["2100 sqft, $420k"]
      P5["2400 sqft, $480k"]
      P6["3000 sqft, $580k"]
  end
  data --> L["Best-fit line passes<br/>through the middle of the points"]
  L --> Q["New house 2000 sqft<br/>Predicted: ~$390k"]

In plain terms, the process works like this:

  1. Start with a guess: “price = some_weight times size + some_weight times bedrooms + a constant”
  2. Measure how far off the predictions are from actual sale prices
  3. Adjust the weights to shrink those errors
  4. Repeat until the errors stop improving

Two methods exist for finding the best weights. The normal equations solve for them directly using linear algebra. Gradient descent finds them iteratively by taking small steps downhill on the error surface. Both arrive at the same answer. Small datasets favor normal equations. Large datasets favor gradient descent.

Now let’s write this down mathematically.

The model

Let’s turn the intuition from above into math. We said: “price = some_weight times size + some_weight times bedrooms + a constant.” That’s exactly what the equation says:

y^=wTx+b\hat{y} = w^T x + b

Here xx is a list of features for one house, y^\hat{y} is the predicted price, and bb is the bias (a baseline price even if all features were zero). But what is ww?

What is a weight vector?

The weight vector A list of numbers, one per feature, that tells the model how much each feature contributes to the prediction. Larger weight = bigger influence. ww is just a list of numbers, one for each feature. Each number says “for every one-unit increase in this feature, the prediction goes up by this much.”

Using our house data, suppose the model learns these weights:

FeatureWeightMeaning
Size (sqft)w1=0.183w_1 = 0.183Each extra sqft adds $183 to the price
Bedroomsw2=15.0w_2 = 15.0Each extra bedroom adds $15k
(Bias)b=35.0b = 35.0Base price is $35k

So for House D (2,100 sqft, 4 bedrooms), plug the values into the formula:

y^=w1×size+w2×bedrooms+b\hat{y} = w_1 \times \text{size} + w_2 \times \text{bedrooms} + b

y^=0.183×2100+15.0×4+35.0=384.3+60+35=479.3\hat{y} = 0.183 \times 2100 + 15.0 \times 4 + 35.0 = 384.3 + 60 + 35 = 479.3

That’s close to the actual price of $480k. The prediction is just a weighted sum of the features plus a baseline.

In compact notation, w=[0.183, 15.0]w = [0.183,\ 15.0] and x=[2100, 4]x = [2100,\ 4]. The dot product wTxw^T x multiplies matching entries and adds them up: 0.183×2100+15.0×4=444.30.183 \times 2100 + 15.0 \times 4 = 444.3. Then add the bias b=35b = 35 to get 479.3479.3.

The bias trick: absorbing bb into ww

Tracking the bias bb separately is annoying. There’s a simple trick to fold it into ww so we only have one thing to deal with.

The idea: pretend every house has a fake feature that’s always 1. Then the bias just becomes another weight.

Before the trick (3 separate pieces):

y^=w1size weightsize+w2bedroom weightbedrooms+bbias\hat{y} = \underbrace{w_1}_\text{size weight} \cdot \text{size} + \underbrace{w_2}_\text{bedroom weight} \cdot \text{bedrooms} + \underbrace{b}_\text{bias}

After the trick (everything in one vector):

y^=w0= b = 351fake feature+w1= 0.183size+w2= 15bedrooms\hat{y} = \underbrace{w_0}_\text{= b = 35} \cdot \underbrace{1}_\text{fake feature} + \underbrace{w_1}_\text{= 0.183} \cdot \text{size} + \underbrace{w_2}_\text{= 15} \cdot \text{bedrooms}

We prepend 1 to the feature vector and prepend bb to the weight vector:

x=[121004],w=[350.18315],y^=wTx=35×1+0.183×2100+15×4=479.3x' = \begin{bmatrix} 1 \\ 2100 \\ 4 \end{bmatrix}, \quad w' = \begin{bmatrix} 35 \\ 0.183 \\ 15 \end{bmatrix}, \quad \hat{y} = w'^T x' = 35 \times 1 + 0.183 \times 2100 + 15 \times 4 = 479.3

Now the prediction is just y^=wTx\hat{y} = w'^T x', one dot product, no separate bias term. From here on, when we write ww and xx, we mean these extended versions (with the 1 prepended).

Matrix form: predicting all houses at once

Instead of plugging in one house at a time, we can stack all 6 houses into a single table (matrix) and predict all prices in one shot.

Each row is one house. The first column is our fake “always 1” feature for the bias:

X=[112002115003118003121004124004130005],w=[350.18315]X = \begin{bmatrix} 1 & 1200 & 2 \\ 1 & 1500 & 3 \\ 1 & 1800 & 3 \\ 1 & 2100 & 4 \\ 1 & 2400 & 4 \\ 1 & 3000 & 5 \end{bmatrix}, \quad w = \begin{bmatrix} 35 \\ 0.183 \\ 15 \end{bmatrix}

When you multiply XwXw, each row of XX gets dot-producted with ww, giving one prediction per house:

Xw=[1×35+1200×0.183+2×151×35+1500×0.183+3×151×35+3000×0.183+5×15]=[284.6354.5659.0]Xw = \begin{bmatrix} 1 \times 35 + 1200 \times 0.183 + 2 \times 15 \\ 1 \times 35 + 1500 \times 0.183 + 3 \times 15 \\ \vdots \\ 1 \times 35 + 3000 \times 0.183 + 5 \times 15 \end{bmatrix} = \begin{bmatrix} 284.6 \\ 354.5 \\ \vdots \\ 659.0 \end{bmatrix}

In short: y^=Xw\hat{y} = Xw gives all predictions at once. XX is an n×(d+1)n \times (d+1) matrix (nn = number of houses, dd = number of real features, +1 for the bias column).

The cost function

Now we have a model that makes predictions. But how do we know if the predictions are any good?

A cost function A single number that measures how wrong your model's predictions are. The goal of training is to make this number as small as possible. Also called a loss function. (also called a loss function) is a single number that answers: “how far off are we?” Think of it like a score in a game where lower is better. If predictions exactly match the real prices, the cost is zero. The further off they are, the higher the cost.

The most common cost function for linear regression is mean squared error (MSE).

Mean squared error

We want predictions close to the true targets. MSE measures the average squared gap, loss:

L(w)=1ni=1n(y^iyi)2=1nXwy2L(w) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 = \frac{1}{n} \|Xw - y\|^2

Why do we square the errors instead of just adding them up? Two reasons:

  • First, squaring makes all errors positive (a prediction that’s 10k too high and one that’s 10k too low both count equally).
  • Second, squaring punishes big misses much harder than small ones: being off by 100k costs 100 times more than being off by 10k, which pushes the model to avoid large blunders.

As a bonus, the squared error creates a smooth, bowl-shaped (convex) surface with a single lowest point, so we’re guaranteed to find the best possible weights. Wondering why? How does the graph of x2x^2 It's a parabola opening upwards, shaped like a bowl. The lowest point is at x=0, where the error is zero, as you move away from zero in either direction, the error increases smoothly. This means that as we adjust our weights to minimize the MSE, we can follow the slope down to the single global minimum without getting stuck in local minima. look?

Residual plot: prediction errors for each house

Minimizing the cost function

We’ve defined the cost function, now we need to find the weights that minimize it. There are two standard approaches:

Method 1: the normal equations

Setting the gradient of LL Loss function to zero and solving gives the closed-form solution. We’ll work with the un-averaged form L(w)=Xwy2=(Xwy)T(Xwy)L(w) = \|Xw - y\|^2 = (Xw - y)^T(Xw - y).

Expand:

L(w)=wTXTXw2wTXTy+yTyL(w) = w^T X^T X w - 2 w^T X^T y + y^T y

Take the gradient and set it to zero:

wL=2XTXw2XTy=0\nabla_w L = 2 X^T X w - 2 X^T y = 0

Solve for ww:

XTXw=XTyX^T X w = X^T y

w=(XTX)1XTyw^* = (X^T X)^{-1} X^T y

This is the normal equations solution. It gives the exact optimal weights in one shot, no iteration needed.

Normal equations step by step:

graph TD
  A["Build design matrix X<br/>with column of ones prepended"] --> B["Compute X-transpose times X"]
  B --> C["Compute X-transpose times y"]
  C --> D["Invert the matrix X-transpose X"]
  D --> E["Multiply: w-star =<br/>inverse times X-transpose y"]
  E --> F["Exact optimal weights in one shot"]

When to use it: small datasets where XTXX^T X (a (d+1)×(d+1)(d+1) \times (d+1) matrix) is easy to invert. The cost is O(d3)O(d^3) for the inversion, plus O(nd2)O(nd^2) to compute XTXX^T X.

When not to use it: when dd is large (thousands of features), when XTXX^T X is singular or nearly so, or when data doesn’t fit in memory.

Example 1: fitting a line to 4 points using normal equations

Data points: (1,2),(2,3),(3,5),(4,4)(1, 2), (2, 3), (3, 5), (4, 4).

We want y^=w0+w1x\hat{y} = w_0 + w_1 x. Build the design matrix by prepending a column of ones:

X=[11121314],y=[2354]X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix}, \quad y = \begin{bmatrix} 2 \\ 3 \\ 5 \\ 4 \end{bmatrix}

Step 1: compute XTXX^T X.

XTX=[11111234][11121314]=[4101030]X^T X = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix} = \begin{bmatrix} 4 & 10 \\ 10 & 30 \end{bmatrix}

Step 2: compute XTyX^T y.

XTy=[11111234][2354]=[1439]X^T y = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \\ 5 \\ 4 \end{bmatrix} = \begin{bmatrix} 14 \\ 39 \end{bmatrix}

Step 3: invert XTXX^T X.

For a 2×22 \times 2 matrix [abcd]\begin{bmatrix} a & b \\ c & d \end{bmatrix}, the inverse is 1adbc[dbca]\frac{1}{ad - bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}.

det=4301010=120100=20\det = 4 \cdot 30 - 10 \cdot 10 = 120 - 100 = 20

(XTX)1=120[3010104]=[1.50.50.50.2](X^T X)^{-1} = \frac{1}{20} \begin{bmatrix} 30 & -10 \\ -10 & 4 \end{bmatrix} = \begin{bmatrix} 1.5 & -0.5 \\ -0.5 & 0.2 \end{bmatrix}

Step 4: compute w=(XTX)1XTyw^* = (X^T X)^{-1} X^T y.

w=[1.50.50.50.2][1439]=[1.514+(0.5)39(0.5)14+0.239]=[2119.57+7.8]=[1.50.8]w^* = \begin{bmatrix} 1.5 & -0.5 \\ -0.5 & 0.2 \end{bmatrix} \begin{bmatrix} 14 \\ 39 \end{bmatrix} = \begin{bmatrix} 1.5 \cdot 14 + (-0.5) \cdot 39 \\ (-0.5) \cdot 14 + 0.2 \cdot 39 \end{bmatrix} = \begin{bmatrix} 21 - 19.5 \\ -7 + 7.8 \end{bmatrix} = \begin{bmatrix} 1.5 \\ 0.8 \end{bmatrix}

So w0=1.5w_0 = 1.5 (intercept) and w1=0.8w_1 = 0.8 (slope). The fitted line is:

y^=1.5+0.8x\hat{y} = 1.5 + 0.8x

Verify: y^(1)=2.3\hat{y}(1) = 2.3, y^(2)=3.1\hat{y}(2) = 3.1, y^(3)=3.9\hat{y}(3) = 3.9, y^(4)=4.7\hat{y}(4) = 4.7. The predictions are close to actual values [2,3,5,4][2, 3, 5, 4].

MSE:

=(2.32)2+(3.13)2+(3.95)2+(4.74)24= \frac{(2.3-2)^2 + (3.1-3)^2 + (3.9-5)^2 + (4.7-4)^2}{4}

=0.09+0.01+1.21+0.494=1.804=0.45= \frac{0.09 + 0.01 + 1.21 + 0.49}{4} = \frac{1.80}{4} = 0.45

Method 2: gradient descent

When the normal equations are too expensive, use gradient descent. It’s iterative: start with a guess, compute the gradient of the loss, and take a small step in the opposite direction.

The gradient of the MSE loss:

wL=2nXT(Xwy)\nabla_w L = \frac{2}{n} X^T(Xw - y)

The update rule:

wwαwLw \leftarrow w - \alpha \nabla_w L

where α\alpha is the learning rate. Repeat until the loss stops decreasing (or changes by less than some tolerance).

Gradient descent iteration loop:

graph TD
  A["Initialize weights to zero"] --> B["Compute predictions: X times w"]
  B --> C["Compute gradient of MSE loss"]
  C --> D["Update: w = w minus alpha times gradient"]
  D --> E{"Loss converged?"}
  E -->|No| B
  E -->|Yes| F["Final weights"]
def gradient_descent(X, y, alpha=0.01, epochs=1000):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        predictions = X @ w
        gradient = (2 / n) * X.T @ (predictions - y)
        w = w - alpha * gradient
    return w

Example 2: gradient descent on the same 4 points

Same data: XX (with ones column) and y=[2,3,5,4]Ty = [2, 3, 5, 4]^T.

Initialize w=[0,0]Tw = [0, 0]^T. Learning rate α=0.05\alpha = 0.05.

Iteration 1:

Predictions: Xw=[0,0,0,0]TXw = [0, 0, 0, 0]^T

Residuals: Xwy=[2,3,5,4]TXw - y = [-2, -3, -5, -4]^T

Gradient:

wL=24XT[2354]=12[11111234][2354]\nabla_w L = \frac{2}{4} X^T \begin{bmatrix} -2 \\ -3 \\ -5 \\ -4 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} -2 \\ -3 \\ -5 \\ -4 \end{bmatrix}

=12[1439]=[719.5]= \frac{1}{2} \begin{bmatrix} -14 \\ -39 \end{bmatrix} = \begin{bmatrix} -7 \\ -19.5 \end{bmatrix}

Update:

w[00]0.05[719.5]=[0.350.975]w \leftarrow \begin{bmatrix} 0 \\ 0 \end{bmatrix} - 0.05 \begin{bmatrix} -7 \\ -19.5 \end{bmatrix} = \begin{bmatrix} 0.35 \\ 0.975 \end{bmatrix}

Iteration 2:

Predictions: Xw=[0.35+0.9751,  0.35+0.9752,  0.35+0.9753,  0.35+0.9754]TXw = [0.35 + 0.975 \cdot 1, \; 0.35 + 0.975 \cdot 2, \; 0.35 + 0.975 \cdot 3, \; 0.35 + 0.975 \cdot 4]^T

=[1.325,  2.300,  3.275,  4.250]T= [1.325, \; 2.300, \; 3.275, \; 4.250]^T

Residuals: [1.3252,  2.3003,  3.2755,  4.2504]T=[0.675,  0.700,  1.725,  0.250]T[1.325 - 2, \; 2.300 - 3, \; 3.275 - 5, \; 4.250 - 4]^T = [-0.675, \; -0.700, \; -1.725, \; 0.250]^T

Gradient:

wL=12[11111234][0.6750.7001.7250.250]\nabla_w L = \frac{1}{2} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} -0.675 \\ -0.700 \\ -1.725 \\ 0.250 \end{bmatrix}

=12[2.8507.025]=[1.4253.5125]= \frac{1}{2} \begin{bmatrix} -2.850 \\ -7.025 \end{bmatrix} = \begin{bmatrix} -1.425 \\ -3.5125 \end{bmatrix}

Update:

w[0.350.975]0.05[1.4253.5125]=[0.421251.15063]w \leftarrow \begin{bmatrix} 0.35 \\ 0.975 \end{bmatrix} - 0.05 \begin{bmatrix} -1.425 \\ -3.5125 \end{bmatrix} = \begin{bmatrix} 0.42125 \\ 1.15063 \end{bmatrix}

After just 2 steps, ww is already moving toward the true optimum [1.5,0.8][1.5, 0.8]. The intercept is rising toward 1.5 and the slope is overshooting 0.8 a bit. Over many more iterations, the weights converge.

After 1000 iterations (you can verify with code), gradient descent gives approximately w[1.5,0.8]w \approx [1.5, 0.8], matching the normal equations.

import numpy as np

X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2, 3, 5, 4])

w = np.array([0.0, 0.0])
alpha = 0.05
n = len(y)

for i in range(1000):
    pred = X @ w
    grad = (2 / n) * X.T @ (pred - y)
    w = w - alpha * grad

print(f"w = {w}")  # [1.49999..., 0.80000...]

Normal equations vs gradient descent

Choosing between the two methods:

graph TD
  Q{"How large is the<br/>feature count d?"} -->|"d is small" | NE["Normal equations:<br/>exact, one-shot, no tuning"]
  Q -->|"d is large or<br/>X-transpose X is singular"| GD["Gradient descent:<br/>iterative, scalable, flexible"]
  NE --> R["Solve: w = inverse of<br/>X-transpose X times X-transpose y"]
  GD --> S["Loop: w = w minus alpha<br/>times gradient until converged"]
Normal equationsGradient descent
Formulaw=(XTX)1XTyw^* = (X^TX)^{-1}X^TywwαLw \leftarrow w - \alpha \nabla L
Exact?YesApproximate (gets closer each step)
CostO(nd2+d3)O(nd^2 + d^3)O(nd)O(nd) per step
Large nn?Slow to compute XTXX^TXHandles batches (SGD)
Large dd?d3d^3 inversion is expensiveFine, cost is linear in dd
HyperparametersNoneLearning rate, epochs
Singular XTXX^TX?Fails or needs pseudo-inverseStill works

For most real problems, gradient descent (or its variants like SGD, Adam, RMSProp) is the go-to method. Normal equations are great for understanding and for small problems.

Geometric interpretation

The prediction y^=Xw\hat{y} = Xw lives in the column space of XX. The optimal ww^* is the one that makes y^\hat{y} the orthogonal projection of yy onto this column space. The residual vector yy^y - \hat{y} is perpendicular to every column of XX:

XT(yXw)=0X^T(y - Xw^*) = 0

This is exactly the normal equation XTXw=XTyX^TX w^* = X^T y rearranged. You’re finding the closest point in the column space of XX to the target vector yy, measured by Euclidean distance.

Multi-feature linear regression

Everything extends naturally to multiple features. With dd features:

y^=w0+w1x1+w2x2++wdxd\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d

The math is identical. XX becomes an n×(d+1)n \times (d+1) matrix, ww is a (d+1)(d+1)-dimensional vector. Normal equations and gradient descent work the same way.

The only practical difference is that you should scale your features (as covered in the data pipeline article) before running gradient descent. Normal equations don’t need scaling because they solve directly.

Common pitfalls

Forgetting the bias term. If you don’t include a column of ones in XX, your line is forced through the origin. Unless you have a reason to do this, always include the intercept.

Learning rate too large. Gradient descent overshoots and the loss explodes. Start with α=0.01\alpha = 0.01 and adjust.

Learning rate too small. Convergence takes forever. If the loss barely changes between iterations, try increasing α\alpha.

Not checking for convergence. Always plot the loss over iterations. It should decrease smoothly. If it oscillates, reduce the learning rate.

losses = []
for i in range(1000):
    pred = X @ w
    loss = np.mean((pred - y) ** 2)
    losses.append(loss)
    grad = (2 / n) * X.T @ (pred - y)
    w = w - alpha * grad

import matplotlib.pyplot as plt
plt.plot(losses)
plt.xlabel("Iteration")
plt.ylabel("MSE")
plt.title("Loss curve")
plt.show()

Summary

Linear regression finds the best linear function y^=wTx+b\hat{y} = w^Tx + b by minimizing mean squared error. You can solve it exactly with the normal equations or iteratively with gradient descent. The normal equations are elegant and exact but expensive for large datasets. Gradient descent is flexible, scales to any size, and generalizes to every other ML model you’ll learn.

What comes next

Your model fits the training data, but how well does it work on new data? The next article, Bias, variance, and the tradeoff, explains overfitting, underfitting, and how to diagnose which problem you have.

Start typing to search across all content
navigate Enter open Esc close