Least squares: the closed-form solution
In this series (18 parts)
- What is optimization and why ML needs it
- Convex sets and convex functions
- Optimality conditions: first order
- Optimality conditions: second order
- Line search methods
- Least squares: the closed-form solution
- Steepest descent (gradient descent)
- Newton's method for optimization
- Quasi-Newton methods: BFGS and L-BFGS
- Conjugate gradient methods
- Constrained optimization and Lagrangian duality
- KKT conditions
- Penalty and barrier methods
- Interior point methods
- The simplex method
- Frank-Wolfe method
- Optimization in dynamic programming and optimal control
- Stochastic gradient descent and variants
Most optimization problems require iterative algorithms. Least squares is different. You can write down the answer in one formula. This closed-form solution is why linear regression is fast, reliable, and one of the first models everyone learns. Understanding where it comes from, and when it breaks, gives you a solid foundation for everything else.
The data fitting problem
You have measured data. You want a line (or plane, or hyperplane) that fits it as closely as possible. Here are five data points:
| Observation | (actual) | |
|---|---|---|
| 1 | 1 | 2.1 |
| 2 | 2 | 3.8 |
| 3 | 3 | 6.2 |
| 4 | 4 | 7.9 |
| 5 | 5 | 10.1 |
The pattern looks roughly linear. You want to find the line that best explains these points. But “best” needs a definition.
For each point, the residual is the gap between the actual and the predicted . Least squares says: find the line that makes the total squared residual as small as possible. Why squared? Squaring penalizes large errors more than small ones, and it gives a smooth, differentiable objective with a single minimum.
Residuals: the gaps between data and the fitted line
graph TD D1["Data point 1"] --- R1["Residual: actual - predicted"] D2["Data point 2"] --- R2["Residual: actual - predicted"] D3["Data point 3"] --- R3["Residual: actual - predicted"] R1 --- L["Fitted line: y-hat = w0 + w1 x"] R2 --- L R3 --- L L --- OBJ["Minimize sum of squared residuals"] style OBJ fill:#ccffcc
Why least squares and not something else?
graph LR A["Sum of absolute errors"] -->|"Not smooth at zero"| D["Harder to optimize"] B["Sum of squared errors"] -->|"Smooth, convex, one minimum"| E["Clean closed-form solution"] C["Maximum error"] -->|"Ignores most data"| F["Sensitive to outliers"] style B fill:#ccffcc style E fill:#ccffcc
Least squares fit: the best-fit line minimizes the total squared distance (dashed orange) from each data point to the line
Least squares has a unique advantage: the objective is a smooth quadratic in the weights, which means we can solve for the optimum exactly using calculus. No iterative algorithm needed.
Prerequisites
You should be comfortable with:
- Matrix algebra and inverse, including matrix multiplication and solving linear systems
- First-order optimality conditions
The least squares problem
Given a data matrix (with samples and features) and a target vector , find the weight vector that minimizes:
This is the sum of squared residuals. Each residual measures how far the prediction is from the true value, and we want to make the total squared error as small as possible.
Deriving the normal equations
Step 1. Expand the objective:
Step 2. Take the gradient with respect to :
Step 3. Apply the first-order condition. Set the gradient to zero:
These are the normal equations. If is invertible, the unique solution is:
How the normal equations are derived
flowchart TD
A["Objective: minimize squared residuals"] --> B["Expand into matrix form"]
B --> C["Take gradient w.r.t. w"]
C --> D["Set gradient to zero"]
D --> E["Normal equations: X-transpose X w = X-transpose y"]
E --> F{"Is X-transpose X invertible?"}
F -->|"Yes"| G["Unique solution: w = inv times X-transpose y"]
F -->|"No"| H["Need regularization or pseudoinverse"]
Step 4. Verify this is a minimum (not a maximum or saddle point). The Hessian is:
Since is positive semidefinite (as we showed in the convexity article), the objective is convex. When is invertible, it is positive definite, so the solution is the unique global minimum.
Example 1: Solving a 3-point linear regression by hand
Data points: , , . We fit .
Step 1. Build the design matrix with a column of ones for the intercept:
Step 2. Compute :
Step 3. Compute :
Step 4. Solve :
The inverse of a matrix is .
Step 5. Multiply:
The best-fit line is .
Step 6. Check predictions:
Residuals: , , .
Sum of squared residuals: .
import numpy as np
X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 4, 5])
w = np.linalg.solve(X.T @ X, X.T @ y)
print(f"w0 = {w[0]:.4f}, w1 = {w[1]:.4f}")
# w0 = 0.6667, w1 = 1.5000
residuals = X @ w - y
sse = residuals @ residuals
print(f"Sum of squared residuals: {sse:.4f}")
# Sum of squared residuals: 0.1667
Example 2: Comparing with scikit-learn
Let us verify our hand calculation using scikit-learn:
from sklearn.linear_model import LinearRegression
import numpy as np
x = np.array([1, 2, 3]).reshape(-1, 1)
y = np.array([2, 4, 5])
model = LinearRegression().fit(x, y)
print(f"Intercept: {model.intercept_:.4f}") # 0.6667
print(f"Slope: {model.coef_[0]:.4f}") # 1.5000
print(f"Predictions: {model.predict(x)}")
# [2.1667, 3.6667, 5.1667]
Under the hood, scikit-learn solves exactly the same normal equations (via a numerically stable SVD decomposition rather than explicit matrix inversion).
Geometric interpretation
The normal equations have a beautiful geometric meaning. The prediction vector lives in the column space of , which is a subspace of . The normal equations find the such that is the closest point in the column space to .
The residual vector is orthogonal to the column space of :
This is a projection. The fitted values are the orthogonal projection of onto the column space of . The dot product of the residual with any column of is zero.
The projection matrix is:
and .
graph TD Y["y (target vector)"] -->|"Project onto column space"| YHat["ŷ = Xw* (prediction)"] Y -->|"Residual r = y - ŷ"| R["r ⊥ column space of X"]
When the normal equations fail
The formula requires to be invertible. This fails when:
1. More features than samples ()
If you have 100 features but only 50 data points, is at most rank 50 (the rank of ). A matrix with rank 50 is singular. There are infinitely many solutions, and the system is underdetermined.
2. Multicollinearity
If some columns of are linearly dependent (e.g., you include both temperature in Celsius and Fahrenheit), is singular. Even near-collinearity (columns that are almost but not quite dependent) makes nearly singular, leading to huge and unstable parameter estimates.
3. Numerical issues
Even when is technically invertible, a large condition number means small errors in the data get amplified into large errors in . This is why in practice, software uses the SVD of rather than explicitly forming and inverting .
Example 3: Ridge regression fixes the problem
Ridge regression (also called Tikhonov regularization) adds an L2 penalty:
where is the regularization parameter.
The gradient is:
Setting to zero:
The matrix is always invertible for , because adding to every eigenvalue of makes them all positive.
Numerical example. Using the same data as Example 1, with :
Compare with the unregularized solution: , .
Ridge regression shrinks the coefficients toward zero. The slope dropped from to , and the intercept changed slightly. The bias-variance tradeoff tells us that this small increase in bias can lead to a larger decrease in variance, giving better predictions on new data.
import numpy as np
X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 4, 5])
lam = 2
w_ols = np.linalg.solve(X.T @ X, X.T @ y)
w_ridge = np.linalg.solve(X.T @ X + lam * np.eye(2), X.T @ y)
print(f"OLS: w0={w_ols[0]:.3f}, w1={w_ols[1]:.3f}")
print(f"Ridge: w0={w_ridge[0]:.3f}, w1={w_ridge[1]:.3f}")
# OLS: w0=0.667, w1=1.500
# Ridge: w0=0.591, w1=1.341
We can also compare using scikit-learn’s Ridge:
from sklearn.linear_model import Ridge
import numpy as np
x = np.array([1, 2, 3]).reshape(-1, 1)
y = np.array([2, 4, 5])
# Note: sklearn's Ridge divides the loss by n and uses alpha = lambda
# To match our formulation, set alpha = lambda * n / 2
model = Ridge(alpha=2, fit_intercept=False).fit(
np.column_stack([np.ones(3), x]), y
)
print(f"Coefficients: {model.coef_}")
Computational cost
Solving the normal equations directly requires:
- Computing :
- Inverting or factoring :
Total: .
This is fast when is small (a few hundred features). When is large, the term dominates and becomes impractical. Alternatives include:
| Method | Cost | When to use |
|---|---|---|
| Normal equations | Small | |
| QR decomposition | Medium , better numerics | |
| SVD | When is ill-conditioned | |
| Gradient descent | per iteration | Large and |
For very large datasets, iterative methods like gradient descent or SGD are preferred because they never need to form or invert a matrix.
Closed-form vs iterative: when to use which
graph TD
A["Least squares problem"] --> B{"How large is d?"}
B -->|"Small: d less than a few hundred"| C["Normal equations: exact, fast"]
B -->|"Medium"| D["QR or SVD: stable, exact"]
B -->|"Large: d in thousands or more"| E["Gradient descent or SGD: scalable"]
C --> F["One-shot computation"]
D --> F
E --> G["Iterative: many small steps"]
style C fill:#ccffcc
style E fill:#fff3cd
Connection to other topics
PCA and least squares
Principal component analysis (PCA) can be viewed as finding the best low-rank approximation to the data matrix, which is itself a least squares problem. The principal components are directions that minimize the sum of squared distances from the data to the subspace.
Weighted least squares
Sometimes different data points have different importance. Weighted least squares minimizes:
where is a diagonal weight matrix. The solution is:
From closed form to iterative
Least squares is the last problem in this series where we can write down the answer directly. From here on, we need iterative algorithms. Gradient descent (steepest descent) is the simplest and most important, and it is the subject of the next article.
Summary
| Concept | Key point |
|---|---|
| Normal equations | |
| Closed-form solution | |
| Geometric view | Project onto the column space of |
| Failure cases | , multicollinearity, ill-conditioning |
| Ridge fix | , always invertible |
| Cost | , use iterative methods for large |
Least squares is one of the most important results in all of applied mathematics. It connects statistics (linear regression), geometry (orthogonal projection), and optimization (convex minimization with a closed-form solution) into one elegant framework.
What comes next
The next article introduces steepest descent (gradient descent), the workhorse iterative algorithm for when a closed-form solution does not exist.