Search…

Least squares: the closed-form solution

In this series (18 parts)
  1. What is optimization and why ML needs it
  2. Convex sets and convex functions
  3. Optimality conditions: first order
  4. Optimality conditions: second order
  5. Line search methods
  6. Least squares: the closed-form solution
  7. Steepest descent (gradient descent)
  8. Newton's method for optimization
  9. Quasi-Newton methods: BFGS and L-BFGS
  10. Conjugate gradient methods
  11. Constrained optimization and Lagrangian duality
  12. KKT conditions
  13. Penalty and barrier methods
  14. Interior point methods
  15. The simplex method
  16. Frank-Wolfe method
  17. Optimization in dynamic programming and optimal control
  18. Stochastic gradient descent and variants

Most optimization problems require iterative algorithms. Least squares is different. You can write down the answer in one formula. This closed-form solution is why linear regression is fast, reliable, and one of the first models everyone learns. Understanding where it comes from, and when it breaks, gives you a solid foundation for everything else.

The data fitting problem

You have measured data. You want a line (or plane, or hyperplane) that fits it as closely as possible. Here are five data points:

Observationxxyy (actual)
112.1
223.8
336.2
447.9
5510.1

The pattern looks roughly linear. You want to find the line y^=w0+w1x\hat{y} = w_0 + w_1 x that best explains these points. But “best” needs a definition.

For each point, the residual is the gap between the actual yy and the predicted y^\hat{y}. Least squares says: find the line that makes the total squared residual as small as possible. Why squared? Squaring penalizes large errors more than small ones, and it gives a smooth, differentiable objective with a single minimum.

Residuals: the gaps between data and the fitted line

graph TD
  D1["Data point 1"] --- R1["Residual: actual - predicted"]
  D2["Data point 2"] --- R2["Residual: actual - predicted"]
  D3["Data point 3"] --- R3["Residual: actual - predicted"]
  R1 --- L["Fitted line: y-hat = w0 + w1 x"]
  R2 --- L
  R3 --- L
  L --- OBJ["Minimize sum of squared residuals"]
  style OBJ fill:#ccffcc

Why least squares and not something else?

graph LR
  A["Sum of absolute errors"] -->|"Not smooth at zero"| D["Harder to optimize"]
  B["Sum of squared errors"] -->|"Smooth, convex, one minimum"| E["Clean closed-form solution"]
  C["Maximum error"] -->|"Ignores most data"| F["Sensitive to outliers"]
  style B fill:#ccffcc
  style E fill:#ccffcc

Least squares fit: the best-fit line minimizes the total squared distance (dashed orange) from each data point to the line

Least squares has a unique advantage: the objective is a smooth quadratic in the weights, which means we can solve for the optimum exactly using calculus. No iterative algorithm needed.

Prerequisites

You should be comfortable with:

The least squares problem

Given a data matrix XRn×dX \in \mathbb{R}^{n \times d} (with nn samples and dd features) and a target vector yRn\mathbf{y} \in \mathbb{R}^n, find the weight vector wRd\mathbf{w} \in \mathbb{R}^d that minimizes:

f(w)=Xwy2=(Xwy)T(Xwy)f(\mathbf{w}) = \|X\mathbf{w} - \mathbf{y}\|^2 = (X\mathbf{w} - \mathbf{y})^T(X\mathbf{w} - \mathbf{y})

This is the sum of squared residuals. Each residual ri=xiTwyir_i = \mathbf{x}_i^T\mathbf{w} - y_i measures how far the prediction is from the true value, and we want to make the total squared error as small as possible.

Deriving the normal equations

Step 1. Expand the objective:

f(w)=wTXTXw2yTXw+yTyf(\mathbf{w}) = \mathbf{w}^TX^TX\mathbf{w} - 2\mathbf{y}^TX\mathbf{w} + \mathbf{y}^T\mathbf{y}

Step 2. Take the gradient with respect to w\mathbf{w}:

f(w)=2XTXw2XTy\nabla f(\mathbf{w}) = 2X^TX\mathbf{w} - 2X^T\mathbf{y}

Step 3. Apply the first-order condition. Set the gradient to zero:

2XTXw2XTy=02X^TX\mathbf{w} - 2X^T\mathbf{y} = \mathbf{0}

XTXw=XTyX^TX\mathbf{w} = X^T\mathbf{y}

These are the normal equations. If XTXX^TX is invertible, the unique solution is:

w=(XTX)1XTy\boxed{\mathbf{w}^* = (X^TX)^{-1}X^T\mathbf{y}}

How the normal equations are derived

flowchart TD
  A["Objective: minimize squared residuals"] --> B["Expand into matrix form"]
  B --> C["Take gradient w.r.t. w"]
  C --> D["Set gradient to zero"]
  D --> E["Normal equations: X-transpose X w = X-transpose y"]
  E --> F{"Is X-transpose X invertible?"}
  F -->|"Yes"| G["Unique solution: w = inv times X-transpose y"]
  F -->|"No"| H["Need regularization or pseudoinverse"]

Step 4. Verify this is a minimum (not a maximum or saddle point). The Hessian is:

2f(w)=2XTX\nabla^2 f(\mathbf{w}) = 2X^TX

Since XTXX^TX is positive semidefinite (as we showed in the convexity article), the objective is convex. When XTXX^TX is invertible, it is positive definite, so the solution is the unique global minimum.

Example 1: Solving a 3-point linear regression by hand

Data points: (1,2)(1, 2), (2,4)(2, 4), (3,5)(3, 5). We fit y^=w0+w1x\hat{y} = w_0 + w_1 x.

Step 1. Build the design matrix with a column of ones for the intercept:

X=[111213],y=[245]X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 2 \\ 4 \\ 5 \end{bmatrix}

Step 2. Compute XTXX^TX:

XTX=[111123][111213]=[36614]X^TX = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix}

Step 3. Compute XTyX^T\mathbf{y}:

XTy=[111123][245]=[1125]X^T\mathbf{y} = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 2 \\ 4 \\ 5 \end{bmatrix} = \begin{bmatrix} 11 \\ 25 \end{bmatrix}

Step 4. Solve XTXw=XTyX^TX\mathbf{w} = X^T\mathbf{y}:

[36614][w0w1]=[1125]\begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \end{bmatrix} = \begin{bmatrix} 11 \\ 25 \end{bmatrix}

The inverse of a 2×22 \times 2 matrix [abcd]\begin{bmatrix} a & b \\ c & d \end{bmatrix} is 1adbc[dbca]\frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}.

det=31466=4236=6\det = 3 \cdot 14 - 6 \cdot 6 = 42 - 36 = 6

(XTX)1=16[14663]=[7/3111/2](X^TX)^{-1} = \frac{1}{6}\begin{bmatrix} 14 & -6 \\ -6 & 3 \end{bmatrix} = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \end{bmatrix}

Step 5. Multiply:

w=[7/3111/2][1125]=[77/32511+25/2]=[2/33/2]\mathbf{w}^* = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \end{bmatrix} \begin{bmatrix} 11 \\ 25 \end{bmatrix} = \begin{bmatrix} 77/3 - 25 \\ -11 + 25/2 \end{bmatrix} = \begin{bmatrix} 2/3 \\ 3/2 \end{bmatrix}

The best-fit line is y^=23+32x0.667+1.5x\hat{y} = \frac{2}{3} + \frac{3}{2}x \approx 0.667 + 1.5x.

Step 6. Check predictions:

y^(1)=0.667+1.5=2.167\hat{y}(1) = 0.667 + 1.5 = 2.167

y^(2)=0.667+3.0=3.667\hat{y}(2) = 0.667 + 3.0 = 3.667

y^(3)=0.667+4.5=5.167\hat{y}(3) = 0.667 + 4.5 = 5.167

Residuals: 22.167=0.1672 - 2.167 = -0.167, 43.667=0.3334 - 3.667 = 0.333, 55.167=0.1675 - 5.167 = -0.167.

Sum of squared residuals: 0.028+0.111+0.028=0.1670.028 + 0.111 + 0.028 = 0.167.

import numpy as np

X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 4, 5])

w = np.linalg.solve(X.T @ X, X.T @ y)
print(f"w0 = {w[0]:.4f}, w1 = {w[1]:.4f}")
# w0 = 0.6667, w1 = 1.5000

residuals = X @ w - y
sse = residuals @ residuals
print(f"Sum of squared residuals: {sse:.4f}")
# Sum of squared residuals: 0.1667

Example 2: Comparing with scikit-learn

Let us verify our hand calculation using scikit-learn:

from sklearn.linear_model import LinearRegression
import numpy as np

x = np.array([1, 2, 3]).reshape(-1, 1)
y = np.array([2, 4, 5])

model = LinearRegression().fit(x, y)
print(f"Intercept: {model.intercept_:.4f}")   # 0.6667
print(f"Slope: {model.coef_[0]:.4f}")         # 1.5000
print(f"Predictions: {model.predict(x)}")
# [2.1667, 3.6667, 5.1667]

Under the hood, scikit-learn solves exactly the same normal equations (via a numerically stable SVD decomposition rather than explicit matrix inversion).

Geometric interpretation

The normal equations have a beautiful geometric meaning. The prediction vector XwX\mathbf{w} lives in the column space of XX, which is a subspace of Rn\mathbb{R}^n. The normal equations find the w\mathbf{w} such that XwX\mathbf{w} is the closest point in the column space to y\mathbf{y}.

The residual vector r=yXw\mathbf{r} = \mathbf{y} - X\mathbf{w}^* is orthogonal to the column space of XX:

XT(yXw)=XTyXTXw=0X^T(\mathbf{y} - X\mathbf{w}^*) = X^T\mathbf{y} - X^TX\mathbf{w}^* = \mathbf{0}

This is a projection. The fitted values y^=Xw\hat{\mathbf{y}} = X\mathbf{w}^* are the orthogonal projection of y\mathbf{y} onto the column space of XX. The dot product of the residual with any column of XX is zero.

The projection matrix is:

P=X(XTX)1XTP = X(X^TX)^{-1}X^T

and y^=Py\hat{\mathbf{y}} = P\mathbf{y}.

graph TD
  Y["y (target vector)"] -->|"Project onto column space"| YHat["ŷ = Xw* (prediction)"]
  Y -->|"Residual r = y - ŷ"| R["r ⊥ column space of X"]

When the normal equations fail

The formula w=(XTX)1XTy\mathbf{w}^* = (X^TX)^{-1}X^T\mathbf{y} requires XTXX^TX to be invertible. This fails when:

1. More features than samples (d>nd > n)

If you have 100 features but only 50 data points, XTXX^TX is at most rank 50 (the rank of XX). A 100×100100 \times 100 matrix with rank 50 is singular. There are infinitely many solutions, and the system is underdetermined.

2. Multicollinearity

If some columns of XX are linearly dependent (e.g., you include both temperature in Celsius and Fahrenheit), XTXX^TX is singular. Even near-collinearity (columns that are almost but not quite dependent) makes XTXX^TX nearly singular, leading to huge and unstable parameter estimates.

3. Numerical issues

Even when XTXX^TX is technically invertible, a large condition number means small errors in the data get amplified into large errors in w\mathbf{w}^*. This is why in practice, software uses the SVD of XX rather than explicitly forming and inverting XTXX^TX.

Example 3: Ridge regression fixes the problem

Ridge regression (also called Tikhonov regularization) adds an L2 penalty:

f(w)=Xwy2+λw2f(\mathbf{w}) = \|X\mathbf{w} - \mathbf{y}\|^2 + \lambda \|\mathbf{w}\|^2

where λ>0\lambda > 0 is the regularization parameter.

The gradient is:

f=2XTXw2XTy+2λw\nabla f = 2X^TX\mathbf{w} - 2X^T\mathbf{y} + 2\lambda\mathbf{w}

Setting to zero:

(XTX+λI)w=XTy(X^TX + \lambda I)\mathbf{w} = X^T\mathbf{y}

wridge=(XTX+λI)1XTy\boxed{\mathbf{w}^*_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^T\mathbf{y}}

The matrix XTX+λIX^TX + \lambda I is always invertible for λ>0\lambda > 0, because adding λ\lambda to every eigenvalue of XTXX^TX makes them all positive.

Numerical example. Using the same data as Example 1, with λ=2\lambda = 2:

XTX+2I=[36614]+[2002]=[56616]X^TX + 2I = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} + \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 5 & 6 \\ 6 & 16 \end{bmatrix}

det=51636=44\det = 5 \cdot 16 - 36 = 44

(XTX+2I)1=144[16665](X^TX + 2I)^{-1} = \frac{1}{44}\begin{bmatrix} 16 & -6 \\ -6 & 5 \end{bmatrix}

wridge=144[16665][1125]=144[17615066+125]=144[2659]=[0.5911.341]\mathbf{w}^*_{\text{ridge}} = \frac{1}{44}\begin{bmatrix} 16 & -6 \\ -6 & 5 \end{bmatrix}\begin{bmatrix} 11 \\ 25 \end{bmatrix} = \frac{1}{44}\begin{bmatrix} 176 - 150 \\ -66 + 125 \end{bmatrix} = \frac{1}{44}\begin{bmatrix} 26 \\ 59 \end{bmatrix} = \begin{bmatrix} 0.591 \\ 1.341 \end{bmatrix}

Compare with the unregularized solution: w0=0.667w_0 = 0.667, w1=1.500w_1 = 1.500.

Ridge regression shrinks the coefficients toward zero. The slope dropped from 1.51.5 to 1.3411.341, and the intercept changed slightly. The bias-variance tradeoff tells us that this small increase in bias can lead to a larger decrease in variance, giving better predictions on new data.

import numpy as np

X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 4, 5])
lam = 2

w_ols = np.linalg.solve(X.T @ X, X.T @ y)
w_ridge = np.linalg.solve(X.T @ X + lam * np.eye(2), X.T @ y)

print(f"OLS:   w0={w_ols[0]:.3f}, w1={w_ols[1]:.3f}")
print(f"Ridge: w0={w_ridge[0]:.3f}, w1={w_ridge[1]:.3f}")
# OLS:   w0=0.667, w1=1.500
# Ridge: w0=0.591, w1=1.341

We can also compare using scikit-learn’s Ridge:

from sklearn.linear_model import Ridge
import numpy as np

x = np.array([1, 2, 3]).reshape(-1, 1)
y = np.array([2, 4, 5])

# Note: sklearn's Ridge divides the loss by n and uses alpha = lambda
# To match our formulation, set alpha = lambda * n / 2
model = Ridge(alpha=2, fit_intercept=False).fit(
    np.column_stack([np.ones(3), x]), y
)
print(f"Coefficients: {model.coef_}")

Computational cost

Solving the normal equations directly requires:

  • Computing XTXX^TX: O(nd2)O(nd^2)
  • Inverting or factoring XTXX^TX: O(d3)O(d^3)

Total: O(nd2+d3)O(nd^2 + d^3).

This is fast when dd is small (a few hundred features). When dd is large, the d3d^3 term dominates and becomes impractical. Alternatives include:

MethodCostWhen to use
Normal equationsO(nd2+d3)O(nd^2 + d^3)Small dd
QR decompositionO(nd2)O(nd^2)Medium dd, better numerics
SVDO(nd2)O(nd^2)When XTXX^TX is ill-conditioned
Gradient descentO(nd)O(nd) per iterationLarge dd and nn

For very large datasets, iterative methods like gradient descent or SGD are preferred because they never need to form or invert a d×dd \times d matrix.

Closed-form vs iterative: when to use which

graph TD
  A["Least squares problem"] --> B{"How large is d?"}
  B -->|"Small: d less than a few hundred"| C["Normal equations: exact, fast"]
  B -->|"Medium"| D["QR or SVD: stable, exact"]
  B -->|"Large: d in thousands or more"| E["Gradient descent or SGD: scalable"]
  C --> F["One-shot computation"]
  D --> F
  E --> G["Iterative: many small steps"]
  style C fill:#ccffcc
  style E fill:#fff3cd

Connection to other topics

PCA and least squares

Principal component analysis (PCA) can be viewed as finding the best low-rank approximation to the data matrix, which is itself a least squares problem. The principal components are directions that minimize the sum of squared distances from the data to the subspace.

Weighted least squares

Sometimes different data points have different importance. Weighted least squares minimizes:

f(w)=(Xwy)TW(Xwy)f(\mathbf{w}) = (X\mathbf{w} - \mathbf{y})^T W (X\mathbf{w} - \mathbf{y})

where WW is a diagonal weight matrix. The solution is:

w=(XTWX)1XTWy\mathbf{w}^* = (X^TWX)^{-1}X^TW\mathbf{y}

From closed form to iterative

Least squares is the last problem in this series where we can write down the answer directly. From here on, we need iterative algorithms. Gradient descent (steepest descent) is the simplest and most important, and it is the subject of the next article.

Summary

ConceptKey point
Normal equationsXTXw=XTyX^TX\mathbf{w} = X^T\mathbf{y}
Closed-form solutionw=(XTX)1XTy\mathbf{w}^* = (X^TX)^{-1}X^T\mathbf{y}
Geometric viewProject y\mathbf{y} onto the column space of XX
Failure casesd>nd > n, multicollinearity, ill-conditioning
Ridge fix(XTX+λI)1XTy(X^TX + \lambda I)^{-1}X^T\mathbf{y}, always invertible
CostO(nd2+d3)O(nd^2 + d^3), use iterative methods for large dd

Least squares is one of the most important results in all of applied mathematics. It connects statistics (linear regression), geometry (orthogonal projection), and optimization (convex minimization with a closed-form solution) into one elegant framework.

What comes next

The next article introduces steepest descent (gradient descent), the workhorse iterative algorithm for when a closed-form solution does not exist.

Start typing to search across all content
navigate Enter open Esc close