Linear regression
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
Prerequisites: Data, features, and the ML pipeline and Partial derivatives and gradients.
Linear regression is the simplest ML model and one of the most useful. You’re fitting a straight line (or hyperplane) through data to predict a continuous target. It’s the starting point for nearly everything else in this series.
Why linear regression?
You want to predict the selling price of a house. You have data on past sales - size, number of bedrooms, and the final price. Can you use that history to price a new listing?
Here are six recent sales in your neighborhood:
| House | Size (sqft) | Bedrooms | Price ($k) |
|---|---|---|---|
| A | 1,200 | 2 | 250 |
| B | 1,500 | 3 | 310 |
| C | 1,800 | 3 | 370 |
| D | 2,100 | 4 | 420 |
| E | 2,400 | 4 | 480 |
| F | 3,000 | 5 | 580 |
House prices: data points with best-fit regression line
Bigger houses cost more. More bedrooms cost more. The relationship looks roughly linear: each extra square foot adds a fairly consistent dollar amount to the price.
Linear regression draws the best-fit line through this data. Plug in the features of a new house and out comes a predicted price. The model assigns a weight to each feature (how much that feature matters) and adds a constant (the baseline price). The prediction is just a weighted sum.
Linear regression pipeline:
graph LR A["Input features:<br/>size, bedrooms"] --> B["Model:<br/>weighted sum + bias"] B --> C["Predicted price"] C --> D["Compare to<br/>actual price"] D --> E["Adjust weights<br/>to reduce error"] E --> B
Fitting a line through data:
graph TD
subgraph data["Observed sales: size vs price"]
P1["1200 sqft, $250k"]
P2["1500 sqft, $310k"]
P3["1800 sqft, $370k"]
P4["2100 sqft, $420k"]
P5["2400 sqft, $480k"]
P6["3000 sqft, $580k"]
end
data --> L["Best-fit line passes<br/>through the middle of the points"]
L --> Q["New house 2000 sqft<br/>Predicted: ~$390k"]
In plain terms, the process works like this:
- Start with a guess: “price = some_weight times size + some_weight times bedrooms + a constant”
- Measure how far off the predictions are from actual sale prices
- Adjust the weights to shrink those errors
- Repeat until the errors stop improving
Two methods exist for finding the best weights. The normal equations solve for them directly using linear algebra. Gradient descent finds them iteratively by taking small steps downhill on the error surface. Both arrive at the same answer. Small datasets favor normal equations. Large datasets favor gradient descent.
Now let’s write this down mathematically.
The model
Let’s turn the intuition from above into math. We said: “price = some_weight times size + some_weight times bedrooms + a constant.” That’s exactly what the equation says:
Here is a list of features for one house, is the predicted price, and is the bias (a baseline price even if all features were zero). But what is ?
What is a weight vector?
The weight vector A list of numbers, one per feature, that tells the model how much each feature contributes to the prediction. Larger weight = bigger influence. is just a list of numbers, one for each feature. Each number says “for every one-unit increase in this feature, the prediction goes up by this much.”
Using our house data, suppose the model learns these weights:
| Feature | Weight | Meaning |
|---|---|---|
| Size (sqft) | Each extra sqft adds $183 to the price | |
| Bedrooms | Each extra bedroom adds $15k | |
| (Bias) | Base price is $35k |
So for House D (2,100 sqft, 4 bedrooms), plug the values into the formula:
That’s close to the actual price of $480k. The prediction is just a weighted sum of the features plus a baseline.
In compact notation, and . The dot product multiplies matching entries and adds them up: . Then add the bias to get .
The bias trick: absorbing into
Tracking the bias separately is annoying. There’s a simple trick to fold it into so we only have one thing to deal with.
The idea: pretend every house has a fake feature that’s always 1. Then the bias just becomes another weight.
Before the trick (3 separate pieces):
After the trick (everything in one vector):
We prepend 1 to the feature vector and prepend to the weight vector:
Now the prediction is just , one dot product, no separate bias term. From here on, when we write and , we mean these extended versions (with the 1 prepended).
Matrix form: predicting all houses at once
Instead of plugging in one house at a time, we can stack all 6 houses into a single table (matrix) and predict all prices in one shot.
Each row is one house. The first column is our fake “always 1” feature for the bias:
When you multiply , each row of gets dot-producted with , giving one prediction per house:
In short: gives all predictions at once. is an matrix ( = number of houses, = number of real features, +1 for the bias column).
The cost function
Now we have a model that makes predictions. But how do we know if the predictions are any good?
A cost function A single number that measures how wrong your model's predictions are. The goal of training is to make this number as small as possible. Also called a loss function. (also called a loss function) is a single number that answers: “how far off are we?” Think of it like a score in a game where lower is better. If predictions exactly match the real prices, the cost is zero. The further off they are, the higher the cost.
The most common cost function for linear regression is mean squared error (MSE).
Mean squared error
We want predictions close to the true targets. MSE measures the average squared gap, loss:
Why do we square the errors instead of just adding them up? Two reasons:
- First, squaring makes all errors positive (a prediction that’s 10k too high and one that’s 10k too low both count equally).
- Second, squaring punishes big misses much harder than small ones: being off by 100k costs 100 times more than being off by 10k, which pushes the model to avoid large blunders.
As a bonus, the squared error creates a smooth, bowl-shaped (convex) surface with a single lowest point, so we’re guaranteed to find the best possible weights. Wondering why? How does the graph of It's a parabola opening upwards, shaped like a bowl. The lowest point is at x=0, where the error is zero, as you move away from zero in either direction, the error increases smoothly. This means that as we adjust our weights to minimize the MSE, we can follow the slope down to the single global minimum without getting stuck in local minima. look?
Residual plot: prediction errors for each house
Minimizing the cost function
We’ve defined the cost function, now we need to find the weights that minimize it. There are two standard approaches:
Method 1: the normal equations
Setting the gradient of Loss function to zero and solving gives the closed-form solution. We’ll work with the un-averaged form .
Expand:
Take the gradient and set it to zero:
Solve for :
This is the normal equations solution. It gives the exact optimal weights in one shot, no iteration needed.
Normal equations step by step:
graph TD A["Build design matrix X<br/>with column of ones prepended"] --> B["Compute X-transpose times X"] B --> C["Compute X-transpose times y"] C --> D["Invert the matrix X-transpose X"] D --> E["Multiply: w-star =<br/>inverse times X-transpose y"] E --> F["Exact optimal weights in one shot"]
When to use it: small datasets where (a matrix) is easy to invert. The cost is for the inversion, plus to compute .
When not to use it: when is large (thousands of features), when is singular or nearly so, or when data doesn’t fit in memory.
Example 1: fitting a line to 4 points using normal equations
Data points: .
We want . Build the design matrix by prepending a column of ones:
Step 1: compute .
Step 2: compute .
Step 3: invert .
For a matrix , the inverse is .
Step 4: compute .
So (intercept) and (slope). The fitted line is:
Verify: , , , . The predictions are close to actual values .
MSE:
Method 2: gradient descent
When the normal equations are too expensive, use gradient descent. It’s iterative: start with a guess, compute the gradient of the loss, and take a small step in the opposite direction.
The gradient of the MSE loss:
The update rule:
where is the learning rate. Repeat until the loss stops decreasing (or changes by less than some tolerance).
Gradient descent iteration loop:
graph TD
A["Initialize weights to zero"] --> B["Compute predictions: X times w"]
B --> C["Compute gradient of MSE loss"]
C --> D["Update: w = w minus alpha times gradient"]
D --> E{"Loss converged?"}
E -->|No| B
E -->|Yes| F["Final weights"]
def gradient_descent(X, y, alpha=0.01, epochs=1000):
n, d = X.shape
w = np.zeros(d)
for _ in range(epochs):
predictions = X @ w
gradient = (2 / n) * X.T @ (predictions - y)
w = w - alpha * gradient
return w
Example 2: gradient descent on the same 4 points
Same data: (with ones column) and .
Initialize . Learning rate .
Iteration 1:
Predictions:
Residuals:
Gradient:
Update:
Iteration 2:
Predictions:
Residuals:
Gradient:
Update:
After just 2 steps, is already moving toward the true optimum . The intercept is rising toward 1.5 and the slope is overshooting 0.8 a bit. Over many more iterations, the weights converge.
After 1000 iterations (you can verify with code), gradient descent gives approximately , matching the normal equations.
import numpy as np
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4]])
y = np.array([2, 3, 5, 4])
w = np.array([0.0, 0.0])
alpha = 0.05
n = len(y)
for i in range(1000):
pred = X @ w
grad = (2 / n) * X.T @ (pred - y)
w = w - alpha * grad
print(f"w = {w}") # [1.49999..., 0.80000...]
Normal equations vs gradient descent
Choosing between the two methods:
graph TD
Q{"How large is the<br/>feature count d?"} -->|"d is small" | NE["Normal equations:<br/>exact, one-shot, no tuning"]
Q -->|"d is large or<br/>X-transpose X is singular"| GD["Gradient descent:<br/>iterative, scalable, flexible"]
NE --> R["Solve: w = inverse of<br/>X-transpose X times X-transpose y"]
GD --> S["Loop: w = w minus alpha<br/>times gradient until converged"]
| Normal equations | Gradient descent | |
|---|---|---|
| Formula | ||
| Exact? | Yes | Approximate (gets closer each step) |
| Cost | per step | |
| Large ? | Slow to compute | Handles batches (SGD) |
| Large ? | inversion is expensive | Fine, cost is linear in |
| Hyperparameters | None | Learning rate, epochs |
| Singular ? | Fails or needs pseudo-inverse | Still works |
For most real problems, gradient descent (or its variants like SGD, Adam, RMSProp) is the go-to method. Normal equations are great for understanding and for small problems.
Geometric interpretation
The prediction lives in the column space of . The optimal is the one that makes the orthogonal projection of onto this column space. The residual vector is perpendicular to every column of :
This is exactly the normal equation rearranged. You’re finding the closest point in the column space of to the target vector , measured by Euclidean distance.
Multi-feature linear regression
Everything extends naturally to multiple features. With features:
The math is identical. becomes an matrix, is a -dimensional vector. Normal equations and gradient descent work the same way.
The only practical difference is that you should scale your features (as covered in the data pipeline article) before running gradient descent. Normal equations don’t need scaling because they solve directly.
Common pitfalls
Forgetting the bias term. If you don’t include a column of ones in , your line is forced through the origin. Unless you have a reason to do this, always include the intercept.
Learning rate too large. Gradient descent overshoots and the loss explodes. Start with and adjust.
Learning rate too small. Convergence takes forever. If the loss barely changes between iterations, try increasing .
Not checking for convergence. Always plot the loss over iterations. It should decrease smoothly. If it oscillates, reduce the learning rate.
losses = []
for i in range(1000):
pred = X @ w
loss = np.mean((pred - y) ** 2)
losses.append(loss)
grad = (2 / n) * X.T @ (pred - y)
w = w - alpha * grad
import matplotlib.pyplot as plt
plt.plot(losses)
plt.xlabel("Iteration")
plt.ylabel("MSE")
plt.title("Loss curve")
plt.show()
Summary
Linear regression finds the best linear function by minimizing mean squared error. You can solve it exactly with the normal equations or iteratively with gradient descent. The normal equations are elegant and exact but expensive for large datasets. Gradient descent is flexible, scales to any size, and generalizes to every other ML model you’ll learn.
What comes next
Your model fits the training data, but how well does it work on new data? The next article, Bias, variance, and the tradeoff, explains overfitting, underfitting, and how to diagnose which problem you have.