Mar 25, 2026 · 20 min read · Machine Learning

Linear regression

In this series (18 parts)

Prerequisites: Data, features, and the ML pipeline and Partial derivatives and gradients.

Linear regression is the simplest ML model and one of the most useful. You’re fitting a straight line (or hyperplane) through data to predict a continuous target. It’s the starting point for nearly everything else in this series.

Why linear regression?

You want to predict the selling price of a house, and you have data on past sales - size, number of bedrooms, and the final price. Can you use that history to price a new listing?

Here are six recent sales in your neighborhood:

House	Size (sqft)	Bedrooms	Price ($k)
A	1,200	2	250
B	1,500	3	310
C	1,800	3	370
D	2,100	4	420
E	2,400	4	480
F	3,000	5	580

House prices: data points with best-fit regression line

Bigger houses cost more. More bedrooms cost more. The relationship looks roughly linear: each extra square foot adds a fairly consistent dollar amount to the price.

Linear regression draws the best-fit line through this data. Plug in the features of a new house and out comes a predicted price. The model assigns a weight to each feature (how much that feature matters) and adds a constant (the baseline price). The final prediction is just a weighted sum.

Linear regression pipeline:

graph LR
  A["Input features:<br/>size, bedrooms"] --> B["Model:<br/>weighted sum + bias"]
  B --> C["Predicted price"]
  C --> D["Compare to<br/>actual price"]
  D --> E["Adjust weights<br/>to reduce error"]
  E --> B

Fitting a line through data:

graph TD
  subgraph data["Observed sales: size vs price"]
      P1["1200 sqft, $250k"]
      P2["1500 sqft, $310k"]
      P3["1800 sqft, $370k"]
      P4["2100 sqft, $420k"]
      P5["2400 sqft, $480k"]
      P6["3000 sqft, $580k"]
  end
  data --> L["Best-fit line passes<br/>through the middle of the points"]
  L --> Q["New house 2000 sqft<br/>Predicted: ~$390k"]

In plain terms, the process works like this:

Start with a guess: “price = some_weight times size + some_weight times bedrooms + a constant”
Measure how far off the predictions are from actual sale prices
Adjust the weights to shrink those errors
Repeat until the errors stop improving

Two methods exist for finding the best weights:

The normal equations solve for them directly using linear algebra.
Gradient descent finds them iteratively by taking small steps downhill on the error surface.

Both arrive at the same answer, remember, small datasets favor normal equations and, Large datasets favor gradient descent.

Now let’s write this down mathematically.

The model

Let’s turn the intuition from above into math. We said: “price = some_weight times size + some_weight times bedrooms + a constant.” That’s exactly what the equation says:

$\hat{y} = w^T x + b$

Here $x$ is a list of features for one house, $\hat{y}$ is the predicted price, and $b$ is the bias (a baseline price even if all features were zero). But what is $w$ ?

What is a weight vector?

The weight vector $w$ is just a list of numbers, one for each feature. Each number says “for every one-unit increase in this feature, the prediction goes up by this much.”

Using our house data, suppose the model learns these weights:

Feature	Weight	Meaning
Size (sqft)	$w_1 = 0.183$	Each extra sqft adds $183 to the price
Bedrooms	$w_2 = 15.0$	Each extra bedroom adds $15k
(Bias)	$b = 35.0$	Base price is $35k

So for House D (2,100 sqft, 4 bedrooms), plug the values into the formula:

$\hat{y} = w_1 \times \text{size} + w_2 \times \text{bedrooms} + b$

$\hat{y} = 0.183 \times 2100 + 15.0 \times 4 + 35.0 = 384.3 + 60 + 35 = 479.3$

That’s close to the actual price of $480k. The prediction is just a weighted sum of the features plus a baseline.

In compact notation, $w = [0.183,\ 15.0]$ and $x = [2100,\ 4]$ . The dot product $w^T x$ multiplies matching entries and adds them up: $0.183 \times 2100 + 15.0 \times 4 = 444.3$ . Then add the bias $b = 35$ to get $479.3$ .

The bias trick: absorbing $b$ into $w$

Tracking the bias $b$ separately is annoying. There’s a simple trick to fold it into $w$ so we only have one thing to deal with.

The idea: pretend every house has a fake feature that’s always 1. Then the bias just becomes another weight.

Before the trick (3 separate pieces):

$\hat{y} = \underbrace{w_1}_\text{size weight} \cdot \text{size} + \underbrace{w_2}_\text{bedroom weight} \cdot \text{bedrooms} + \underbrace{b}_\text{bias}$

After the trick (everything in one vector):

$\hat{y} = \underbrace{w_0}_\text{= b = 35} \cdot \underbrace{1}_\text{fake feature} + \underbrace{w_1}_\text{= 0.183} \cdot \text{size} + \underbrace{w_2}_\text{= 15} \cdot \text{bedrooms}$

We prepend 1 to the feature vector and prepend $b$ to the weight vector:

$x' = \begin{bmatrix} 1 \\ 2100 \\ 4 \end{bmatrix}, \quad w' = \begin{bmatrix} 35 \\ 0.183 \\ 15 \end{bmatrix}, \quad \hat{y} = w'^T x' = 35 \times 1 + 0.183 \times 2100 + 15 \times 4 = 479.3$

Now the prediction is just $\hat{y} = w'^T x'$ , one dot product, no separate bias term. From here on, when we write $w$ and $x$ , we mean these extended versions (with the 1 prepended).

Matrix form: predicting all houses at once

Instead of plugging in one house at a time, we can stack all 6 houses into a single table (matrix) and predict all prices in one shot.

Each row is one house. The first column is our fake “always 1” feature for the bias:

$X = \begin{bmatrix} 1 & 1200 & 2 \\ 1 & 1500 & 3 \\ 1 & 1800 & 3 \\ 1 & 2100 & 4 \\ 1 & 2400 & 4 \\ 1 & 3000 & 5 \end{bmatrix}, \quad w = \begin{bmatrix} 35 \\ 0.183 \\ 15 \end{bmatrix}$

When you multiply $Xw$ , each row of $X$ gets dot-producted with $w$ , giving one prediction per house:

$Xw = \begin{bmatrix} 1 \times 35 + 1200 \times 0.183 + 2 \times 15 \\ 1 \times 35 + 1500 \times 0.183 + 3 \times 15 \\ \vdots \\ 1 \times 35 + 3000 \times 0.183 + 5 \times 15 \end{bmatrix} = \begin{bmatrix} 284.6 \\ 354.5 \\ \vdots \\ 659.0 \end{bmatrix}$

In short: $\hat{y} = Xw$ gives all predictions at once. $X$ is an $n \times (d+1)$ matrix ( $n$ = number of houses, $d$ = number of real features, +1 for the bias column).

The cost function

Now we have a model that makes predictions. But how do we know if the predictions are any good?

A cost function (also called a loss function) is a single number that answers: “how far off are we?” Think of it like a score in a game where lower is better. If predictions exactly match the real prices, the cost is zero. The further off they are, the higher the cost.

The most common cost function for linear regression is mean squared error (MSE).

Mean squared error

We want predictions close to the true targets. MSE measures the average squared gap, loss:

$L(w) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 = \frac{1}{n} \|Xw - y\|^2$

Why do we square the errors instead of just adding them up? Two reasons:

First, squaring makes all errors positive (a prediction that’s 10k too high and one that’s 10k too low both count equally).
Second, squaring punishes big misses much harder than small ones: being off by 100k costs 100 times more than being off by 10k, which pushes the model to avoid large blunders.

As a bonus, the squared error creates a smooth, bowl-shaped (convex) surface with a single lowest point, so we’re guaranteed to find the best possible weights. Wondering why? How does the graph of $x^2$ look?

Residual plot: prediction errors for each house

Minimizing the cost function

We’ve defined the cost function, now we need to find the weights that minimize it. There are two standard approaches:

Method 1: the normal equations

Setting the gradient of $L$ to zero and solving gives the closed-form solution. We’ll work with the un-averaged form $L(w) = \|Xw - y\|^2 = (Xw - y)^T(Xw - y)$ .

Expand by distributing $(Xw - y)^T = w^T X^T - y^T$ :

$L(w) = (w^T X^T - y^T)(Xw - y) = w^T X^T X w - w^T X^T y - y^T X w + y^T y$

Since $w^T X^T y$ is a scalar , it equals its own transpose $y^T X w$ , so the two middle terms combine:

$L(w) = w^T X^T X w - 2 w^T X^T y + y^T y$

Take the gradient with respect to $w$ term by term:

$\nabla_w (w^T X^T X w) = 2 X^T X w$ [standard result for quadratic form $w^T A w$ where $A = X^T X$ is symmetric]
$\nabla_w (-2 w^T X^T y) = -2 X^T y$ [gradient of a linear term $w^T b$ is just $b$ ]
$\nabla_w (y^T y) = 0$ [constant with respect to $w$ ]

Setting the gradient to zero:

$\nabla_w L = 2 X^T X w - 2 X^T y = 0$

Solve for $w$ :

$X^T X w = X^T y$

$w^* = (X^T X)^{-1} X^T y$

This is the normal equations solution. It gives the exact optimal weights in one shot, no iteration needed.

Normal equations step by step:

graph TD
  A["Build design matrix X<br/>with column of ones prepended"] --> B["Compute X-transpose times X"]
  B --> C["Compute X-transpose times y"]
  C --> D["Invert the matrix X-transpose X"]
  D --> E["Multiply: w-star =<br/>inverse times X-transpose y"]
  E --> F["Exact optimal weights in one shot"]

When to use it: small datasets where $X^T X$ (a $(d+1) \times (d+1)$ matrix) is easy to invert. The cost is $O(d^3)$ for the inversion, plus $O(nd^2)$ to compute $X^T X$ .

When not to use it: when $d$ is large (thousands of features), when $X^T X$ is singular or nearly so, or when data doesn’t fit in memory.

Example 1: fitting a line to 4 points using normal equations

Data points: $(1, 2), (2, 3), (3, 5), (4, 4)$ .

We want $\hat{y} = w_0 + w_1 x$ . Build the design matrix by prepending a column of ones:

$X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix}, \quad y = \begin{bmatrix} 2 \\ 3 \\ 5 \\ 4 \end{bmatrix}$

Step 1: compute $X^T X$ .

$X^T X = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{bmatrix} = \begin{bmatrix} 4 & 10 \\ 10 & 30 \end{bmatrix}$

Step 2: compute $X^T y$ .

$X^T y = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \\ 5 \\ 4 \end{bmatrix} = \begin{bmatrix} 14 \\ 39 \end{bmatrix}$

Step 3: invert $X^T X$ .

For a $2 \times 2$ matrix $\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ , the inverse is $\frac{1}{ad - bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}$ .

$\det = 4 \cdot 30 - 10 \cdot 10 = 120 - 100 = 20$

$(X^T X)^{-1} = \frac{1}{20} \begin{bmatrix} 30 & -10 \\ -10 & 4 \end{bmatrix} = \begin{bmatrix} 1.5 & -0.5 \\ -0.5 & 0.2 \end{bmatrix}$

Step 4: compute $w^* = (X^T X)^{-1} X^T y$ .

$w^* = \begin{bmatrix} 1.5 & -0.5 \\ -0.5 & 0.2 \end{bmatrix} \begin{bmatrix} 14 \\ 39 \end{bmatrix} = \begin{bmatrix} 1.5 \cdot 14 + (-0.5) \cdot 39 \\ (-0.5) \cdot 14 + 0.2 \cdot 39 \end{bmatrix} = \begin{bmatrix} 21 - 19.5 \\ -7 + 7.8 \end{bmatrix} = \begin{bmatrix} 1.5 \\ 0.8 \end{bmatrix}$

So $w_0 = 1.5$ (intercept) and $w_1 = 0.8$ (slope). The fitted line is:

$\hat{y} = 1.5 + 0.8x$

Verify: $\hat{y}(1) = 2.3$ , $\hat{y}(2) = 3.1$ , $\hat{y}(3) = 3.9$ , $\hat{y}(4) = 4.7$ . The predictions are close to actual values $[2, 3, 5, 4]$ .

MSE:

$= \frac{(2.3-2)^2 + (3.1-3)^2 + (3.9-5)^2 + (4.7-4)^2}{4}$

$= \frac{0.09 + 0.01 + 1.21 + 0.49}{4} = \frac{1.80}{4} = 0.45$

Method 2: gradient descent

When the normal equations are too expensive, use gradient descent. It’s iterative: start with a guess, compute the gradient of the loss, and take a small step in the opposite direction.

The gradient of the MSE loss:

$\nabla_w L = \frac{2}{n} X^T(Xw - y)$

The update rule:

$w \leftarrow w - \alpha \nabla_w L$

where $\alpha$ is the learning rate. Repeat until the loss stops decreasing (or changes by less than some tolerance).

Gradient descent iteration loop:

graph TD
  A["Initialize weights to zero"] --> B["Compute predictions: X times w"]
  B --> C["Compute gradient of MSE loss"]
  C --> D["Update: w = w minus alpha times gradient"]
  D --> E{"Loss converged?"}
  E -->|No| B
  E -->|Yes| F["Final weights"]

def gradient_descent(X, y, alpha=0.01, epochs=1000):
    n, d = X.shape
    w = np.zeros(d)
    for _ in range(epochs):
        predictions = X @ w # matrix multiplication
        gradient = (2 / n) * X.T @ (predictions - y) # matrix multiplication
        w = w - alpha * gradient
    return w

Example 2: gradient descent on the same 4 points

Same data: $X$ (with ones column) and $y = [2, 3, 5, 4]^T$ .

Initialize $w = [0, 0]^T$ . Learning rate $\alpha = 0.05$ .

Iteration 1:

Predictions: $Xw = [0, 0, 0, 0]^T$

Residuals: $Xw - y = [-2, -3, -5, -4]^T$

Gradient:

$\nabla_w L = \frac{2}{4} X^T \begin{bmatrix} -2 \\ -3 \\ -5 \\ -4 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} -2 \\ -3 \\ -5 \\ -4 \end{bmatrix}$

$= \frac{1}{2} \begin{bmatrix} -14 \\ -39 \end{bmatrix} = \begin{bmatrix} -7 \\ -19.5 \end{bmatrix}$

Update:

$w \leftarrow \begin{bmatrix} 0 \\ 0 \end{bmatrix} - 0.05 \begin{bmatrix} -7 \\ -19.5 \end{bmatrix} = \begin{bmatrix} 0.35 \\ 0.975 \end{bmatrix}$

Iteration 2:

Predictions: $Xw = [0.35 + 0.975 \cdot 1, \; 0.35 + 0.975 \cdot 2, \; 0.35 + 0.975 \cdot 3, \; 0.35 + 0.975 \cdot 4]^T$

$= [1.325, \; 2.300, \; 3.275, \; 4.250]^T$

Residuals: $[1.325 - 2, \; 2.300 - 3, \; 3.275 - 5, \; 4.250 - 4]^T = [-0.675, \; -0.700, \; -1.725, \; 0.250]^T$

Gradient:

$\nabla_w L = \frac{1}{2} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} -0.675 \\ -0.700 \\ -1.725 \\ 0.250 \end{bmatrix}$

$= \frac{1}{2} \begin{bmatrix} -2.850 \\ -7.025 \end{bmatrix} = \begin{bmatrix} -1.425 \\ -3.5125 \end{bmatrix}$

Update:

$w \leftarrow \begin{bmatrix} 0.35 \\ 0.975 \end{bmatrix} - 0.05 \begin{bmatrix} -1.425 \\ -3.5125 \end{bmatrix} = \begin{bmatrix} 0.42125 \\ 1.15063 \end{bmatrix}$

After just 2 steps, $w$ is already moving toward the true optimum $[1.5, 0.8]$ . The intercept is rising toward 1.5 and the slope is overshooting 0.8 a bit. Over many more iterations, the weights converge.

After 1000 iterations (you can verify with code), gradient descent gives approximately $w \approx [1.5, 0.8]$ , matching the normal equations.

import numpy as np

X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2, 3, 5, 4])

w = np.array([0.0, 0.0])
alpha = 0.05
n = len(y)

for i in range(1000):
    pred = X @ w # matrix multiplication
    grad = (2 / n) * X.T @ (pred - y) # matrix multiplication
    w = w - alpha * grad

print(f"w = {w}")  # [1.49999..., 0.80000...]

Normal equations vs gradient descent

Choosing between the two methods:

graph TD
  Q{"How large is the<br/>feature count d?"} -->|"d is small" | NE["Normal equations:<br/>exact, one-shot, no tuning"]
  Q -->|"d is large or<br/>X-transpose X is singular"| GD["Gradient descent:<br/>iterative, scalable, flexible"]
  NE --> R["Solve: w = inverse of<br/>X-transpose X times X-transpose y"]
  GD --> S["Loop: w = w minus alpha<br/>times gradient until converged"]

	Normal equations	Gradient descent
Formula	$w^* = (X^TX)^{-1}X^Ty$	$w \leftarrow w - \alpha \nabla L$
Exact?	Yes	Approximate (gets closer each step)
Cost	$O(nd^2 + d^3)$	$O(nd)$ per step
Large $n$ ?	Slow to compute $X^TX$	Handles batches (SGD)
Large $d$ ?	$d^3$ inversion is expensive	Fine, cost is linear in $d$
Hyperparameters	None	Learning rate, epochs
Singular $X^TX$ ?	Fails or needs pseudo-inverse	Still works

For most real problems, gradient descent (or its variants like SGD, Adam, RMSProp) is the go-to method. Normal equations are great for understanding and for small problems.

Geometric interpretation

The prediction $\hat{y} = Xw$ lives in the column space of $X$ . The optimal $w^*$ is the one that makes $\hat{y}$ the orthogonal projection of $y$ onto this column space. The residual vector $y - \hat{y}$ is perpendicular to every column of $X$ :

$X^T(y - Xw^*) = 0$

This is exactly the normal equation $X^TX w^* = X^T y$ rearranged. You’re finding the closest point in the column space of $X$ to the target vector $y$ , measured by Euclidean distance.

Multi-feature linear regression

Everything extends naturally to multiple features. With $d$ features:

$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d$

The math is identical. $X$ becomes an $n \times (d+1)$ matrix, $w$ is a $(d+1)$ -dimensional vector. Normal equations and gradient descent work the same way.

The only practical difference is that you should scale your features (as covered in the data pipeline article) before running gradient descent. Normal equations don’t need scaling because they solve directly.

Common pitfalls

Forgetting the bias term. If you don’t include a column of ones in $X$ , your line is forced through the origin. Unless you have a reason to do this, always include the intercept.

Learning rate too large. Gradient descent overshoots and the loss explodes. Start with $\alpha = 0.01$ and adjust.

Learning rate too small. Convergence takes forever. If the loss barely changes between iterations, try increasing $\alpha$ .

Not checking for convergence. Always plot the loss over iterations. It should decrease smoothly. If it oscillates, reduce the learning rate.

losses = []
for i in range(1000):
    pred = X @ w # matrix multiplication
    loss = np.mean((pred - y) ** 2)
    losses.append(loss)
    grad = (2 / n) * X.T @ (pred - y) # matrix multiplication
    w = w - alpha * grad

import matplotlib.pyplot as plt
plt.plot(losses)
plt.xlabel("Iteration")
plt.ylabel("MSE")
plt.title("Loss curve")
plt.show()

Summary

Linear regression finds the best linear function $\hat{y} = w^Tx + b$ by minimizing mean squared error. You can solve it exactly with the normal equations or iteratively with gradient descent. The normal equations are elegant and exact but expensive for large datasets. Gradient descent is flexible, scales to any size, and generalizes to every other ML model you’ll learn.

What comes next

Your model fits the training data, but how well does it work on new data? The next article, Bias, variance, and the tradeoff, explains overfitting, underfitting, and how to diagnose which problem you have.

← Back to all series