Mar 10, 2026 · 18 min read · ML Optimization

Least squares: the closed-form solution

In this series (18 parts)

Most optimization problems require iterative algorithms. Least squares is different. You can write down the answer in one formula. This closed-form solution is why linear regression is fast, reliable, and one of the first models everyone learns. Understanding where it comes from, and when it breaks, gives you a solid foundation for everything else.

The data fitting problem

You have measured data. You want a line (or plane, or hyperplane) that fits it as closely as possible. Here are five data points:

Observation	$x$	$y$ (actual)
1	1	2.1
2	2	3.8
3	3	6.2
4	4	7.9
5	5	10.1

The pattern looks roughly linear. You want to find the line $\hat{y} = w_0 + w_1 x$ that best explains these points. But “best” needs a definition.

For each point, the residual is the gap between the actual $y$ and the predicted $\hat{y}$ . Least squares says: find the line that makes the total squared residual as small as possible. Why squared? Squaring penalizes large errors more than small ones, and it gives a smooth, differentiable objective with a single minimum.

Residuals: the gaps between data and the fitted line

graph TD
  D1["Data point 1"] --- R1["Residual: actual - predicted"]
  D2["Data point 2"] --- R2["Residual: actual - predicted"]
  D3["Data point 3"] --- R3["Residual: actual - predicted"]
  R1 --- L["Fitted line: y-hat = w0 + w1 x"]
  R2 --- L
  R3 --- L
  L --- OBJ["Minimize sum of squared residuals"]
  style OBJ fill:#ccffcc

Why least squares and not something else?

graph LR
  A["Sum of absolute errors"] -->|"Not smooth at zero"| D["Harder to optimize"]
  B["Sum of squared errors"] -->|"Smooth, convex, one minimum"| E["Clean closed-form solution"]
  C["Maximum error"] -->|"Ignores most data"| F["Sensitive to outliers"]
  style B fill:#ccffcc
  style E fill:#ccffcc

Least squares fit: the best-fit line minimizes the total squared distance (dashed orange) from each data point to the line

Least squares has a unique advantage: the objective is a smooth quadratic in the weights, which means we can solve for the optimum exactly using calculus. No iterative algorithm needed.

Prerequisites

You should be comfortable with:

Matrix algebra and inverse, including matrix multiplication and solving linear systems
First-order optimality conditions

The least squares problem

Given a data matrix $X \in \mathbb{R}^{n \times d}$ (with $n$ samples and $d$ features) and a target vector $\mathbf{y} \in \mathbb{R}^n$ , find the weight vector $\mathbf{w} \in \mathbb{R}^d$ that minimizes:

$f(\mathbf{w}) = \|X\mathbf{w} - \mathbf{y}\|^2 = (X\mathbf{w} - \mathbf{y})^T(X\mathbf{w} - \mathbf{y})$

This is the sum of squared residuals. Each residual $r_i = \mathbf{x}_i^T\mathbf{w} - y_i$ measures how far the prediction is from the true value, and we want to make the total squared error as small as possible.

Deriving the normal equations

Step 1. Expand the objective:

$f(\mathbf{w}) = \mathbf{w}^TX^TX\mathbf{w} - 2\mathbf{y}^TX\mathbf{w} + \mathbf{y}^T\mathbf{y}$

Step 2. Take the gradient with respect to $\mathbf{w}$ :

$\nabla f(\mathbf{w}) = 2X^TX\mathbf{w} - 2X^T\mathbf{y}$

Step 3. Apply the first-order condition. Set the gradient to zero:

$2X^TX\mathbf{w} - 2X^T\mathbf{y} = \mathbf{0}$

$X^TX\mathbf{w} = X^T\mathbf{y}$

These are the normal equations. If $X^TX$ is invertible, the unique solution is:

$\boxed{\mathbf{w}^* = (X^TX)^{-1}X^T\mathbf{y}}$

How the normal equations are derived

flowchart TD
  A["Objective: minimize squared residuals"] --> B["Expand into matrix form"]
  B --> C["Take gradient w.r.t. w"]
  C --> D["Set gradient to zero"]
  D --> E["Normal equations: X-transpose X w = X-transpose y"]
  E --> F{"Is X-transpose X invertible?"}
  F -->|"Yes"| G["Unique solution: w = inv times X-transpose y"]
  F -->|"No"| H["Need regularization or pseudoinverse"]

Step 4. Verify this is a minimum (not a maximum or saddle point). The Hessian is:

$\nabla^2 f(\mathbf{w}) = 2X^TX$

Since $X^TX$ is positive semidefinite (as we showed in the convexity article), the objective is convex. When $X^TX$ is invertible, it is positive definite, so the solution is the unique global minimum.

Example 1: Solving a 3-point linear regression by hand

Data points: $(1, 2)$ , $(2, 4)$ , $(3, 5)$ . We fit $\hat{y} = w_0 + w_1 x$ .

Step 1. Build the design matrix with a column of ones for the intercept:

$X = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 2 \\ 4 \\ 5 \end{bmatrix}$

Step 2. Compute $X^TX$ :

$X^TX = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix}$

Step 3. Compute $X^T\mathbf{y}$ :

$X^T\mathbf{y} = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 2 \\ 4 \\ 5 \end{bmatrix} = \begin{bmatrix} 11 \\ 25 \end{bmatrix}$

Step 4. Solve $X^TX\mathbf{w} = X^T\mathbf{y}$ :

$\begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \end{bmatrix} = \begin{bmatrix} 11 \\ 25 \end{bmatrix}$

The inverse of a $2 \times 2$ matrix $\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ is $\frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}$ .

$\det = 3 \cdot 14 - 6 \cdot 6 = 42 - 36 = 6$

$(X^TX)^{-1} = \frac{1}{6}\begin{bmatrix} 14 & -6 \\ -6 & 3 \end{bmatrix} = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \end{bmatrix}$

Step 5. Multiply:

$\mathbf{w}^* = \begin{bmatrix} 7/3 & -1 \\ -1 & 1/2 \end{bmatrix} \begin{bmatrix} 11 \\ 25 \end{bmatrix} = \begin{bmatrix} 77/3 - 25 \\ -11 + 25/2 \end{bmatrix} = \begin{bmatrix} 2/3 \\ 3/2 \end{bmatrix}$

The best-fit line is $\hat{y} = \frac{2}{3} + \frac{3}{2}x \approx 0.667 + 1.5x$ .

Step 6. Check predictions:

$\hat{y}(1) = 0.667 + 1.5 = 2.167$

$\hat{y}(2) = 0.667 + 3.0 = 3.667$

$\hat{y}(3) = 0.667 + 4.5 = 5.167$

Residuals: $2 - 2.167 = -0.167$ , $4 - 3.667 = 0.333$ , $5 - 5.167 = -0.167$ .

Sum of squared residuals: $0.028 + 0.111 + 0.028 = 0.167$ .

import numpy as np

X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 4, 5])

w = np.linalg.solve(X.T @ X, X.T @ y)
print(f"w0 = {w[0]:.4f}, w1 = {w[1]:.4f}")
# w0 = 0.6667, w1 = 1.5000

residuals = X @ w - y
sse = residuals @ residuals
print(f"Sum of squared residuals: {sse:.4f}")
# Sum of squared residuals: 0.1667

Example 2: Comparing with scikit-learn

Let us verify our hand calculation using scikit-learn:

from sklearn.linear_model import LinearRegression
import numpy as np

x = np.array([1, 2, 3]).reshape(-1, 1)
y = np.array([2, 4, 5])

model = LinearRegression().fit(x, y)
print(f"Intercept: {model.intercept_:.4f}")   # 0.6667
print(f"Slope: {model.coef_[0]:.4f}")         # 1.5000
print(f"Predictions: {model.predict(x)}")
# [2.1667, 3.6667, 5.1667]

Under the hood, scikit-learn solves exactly the same normal equations (via a numerically stable SVD decomposition rather than explicit matrix inversion).

Geometric interpretation

The normal equations have a beautiful geometric meaning. The prediction vector $X\mathbf{w}$ lives in the column space of $X$ , which is a subspace of $\mathbb{R}^n$ . The normal equations find the $\mathbf{w}$ such that $X\mathbf{w}$ is the closest point in the column space to $\mathbf{y}$ .

The residual vector $\mathbf{r} = \mathbf{y} - X\mathbf{w}^*$ is orthogonal to the column space of $X$ :

$X^T(\mathbf{y} - X\mathbf{w}^*) = X^T\mathbf{y} - X^TX\mathbf{w}^* = \mathbf{0}$

This is a projection. The fitted values $\hat{\mathbf{y}} = X\mathbf{w}^*$ are the orthogonal projection of $\mathbf{y}$ onto the column space of $X$ . The dot product of the residual with any column of $X$ is zero.

The projection matrix is:

$P = X(X^TX)^{-1}X^T$

and $\hat{\mathbf{y}} = P\mathbf{y}$ .

graph TD
  Y["y (target vector)"] -->|"Project onto column space"| YHat["ŷ = Xw* (prediction)"]
  Y -->|"Residual r = y - ŷ"| R["r ⊥ column space of X"]

When the normal equations fail

The formula $\mathbf{w}^* = (X^TX)^{-1}X^T\mathbf{y}$ requires $X^TX$ to be invertible. This fails when:

1. More features than samples ( $d > n$ )

If you have 100 features but only 50 data points, $X^TX$ is at most rank 50 (the rank of $X$ ). A $100 \times 100$ matrix with rank 50 is singular. There are infinitely many solutions, and the system is underdetermined.

2. Multicollinearity

If some columns of $X$ are linearly dependent (e.g., you include both temperature in Celsius and Fahrenheit), $X^TX$ is singular. Even near-collinearity (columns that are almost but not quite dependent) makes $X^TX$ nearly singular, leading to huge and unstable parameter estimates.

3. Numerical issues

Even when $X^TX$ is technically invertible, a large condition number means small errors in the data get amplified into large errors in $\mathbf{w}^*$ . This is why in practice, software uses the SVD of $X$ rather than explicitly forming and inverting $X^TX$ .

Example 3: Ridge regression fixes the problem

Ridge regression (also called Tikhonov regularization) adds an L2 penalty:

$f(\mathbf{w}) = \|X\mathbf{w} - \mathbf{y}\|^2 + \lambda \|\mathbf{w}\|^2$

where $\lambda > 0$ is the regularization parameter.

The gradient is:

$\nabla f = 2X^TX\mathbf{w} - 2X^T\mathbf{y} + 2\lambda\mathbf{w}$

Setting to zero:

$(X^TX + \lambda I)\mathbf{w} = X^T\mathbf{y}$

$\boxed{\mathbf{w}^*_{\text{ridge}} = (X^TX + \lambda I)^{-1}X^T\mathbf{y}}$

The matrix $X^TX + \lambda I$ is always invertible for $\lambda > 0$ , because adding $\lambda$ to every eigenvalue of $X^TX$ makes them all positive.

Numerical example. Using the same data as Example 1, with $\lambda = 2$ :

$X^TX + 2I = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} + \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} = \begin{bmatrix} 5 & 6 \\ 6 & 16 \end{bmatrix}$

$\det = 5 \cdot 16 - 36 = 44$

$(X^TX + 2I)^{-1} = \frac{1}{44}\begin{bmatrix} 16 & -6 \\ -6 & 5 \end{bmatrix}$

$\mathbf{w}^*_{\text{ridge}} = \frac{1}{44}\begin{bmatrix} 16 & -6 \\ -6 & 5 \end{bmatrix}\begin{bmatrix} 11 \\ 25 \end{bmatrix} = \frac{1}{44}\begin{bmatrix} 176 - 150 \\ -66 + 125 \end{bmatrix} = \frac{1}{44}\begin{bmatrix} 26 \\ 59 \end{bmatrix} = \begin{bmatrix} 0.591 \\ 1.341 \end{bmatrix}$

Compare with the unregularized solution: $w_0 = 0.667$ , $w_1 = 1.500$ .

Ridge regression shrinks the coefficients toward zero. The slope dropped from $1.5$ to $1.341$ , and the intercept changed slightly. The bias-variance tradeoff tells us that this small increase in bias can lead to a larger decrease in variance, giving better predictions on new data.

import numpy as np

X = np.array([[1, 1], [1, 2], [1, 3]])
y = np.array([2, 4, 5])
lam = 2

w_ols = np.linalg.solve(X.T @ X, X.T @ y)
w_ridge = np.linalg.solve(X.T @ X + lam * np.eye(2), X.T @ y)

print(f"OLS:   w0={w_ols[0]:.3f}, w1={w_ols[1]:.3f}")
print(f"Ridge: w0={w_ridge[0]:.3f}, w1={w_ridge[1]:.3f}")
# OLS:   w0=0.667, w1=1.500
# Ridge: w0=0.591, w1=1.341

We can also compare using scikit-learn’s Ridge:

from sklearn.linear_model import Ridge
import numpy as np

x = np.array([1, 2, 3]).reshape(-1, 1)
y = np.array([2, 4, 5])

# Note: sklearn's Ridge divides the loss by n and uses alpha = lambda
# To match our formulation, set alpha = lambda * n / 2
model = Ridge(alpha=2, fit_intercept=False).fit(
    np.column_stack([np.ones(3), x]), y
)
print(f"Coefficients: {model.coef_}")

Computational cost

Solving the normal equations directly requires:

Computing $X^TX$ : $O(nd^2)$
Inverting or factoring $X^TX$ : $O(d^3)$

Total: $O(nd^2 + d^3)$ .

This is fast when $d$ is small (a few hundred features). When $d$ is large, the $d^3$ term dominates and becomes impractical. Alternatives include:

Method	Cost	When to use
Normal equations	$O(nd^2 + d^3)$	Small $d$
QR decomposition	$O(nd^2)$	Medium $d$ , better numerics
SVD	$O(nd^2)$	When $X^TX$ is ill-conditioned
Gradient descent	$O(nd)$ per iteration	Large $d$ and $n$

For very large datasets, iterative methods like gradient descent or SGD are preferred because they never need to form or invert a $d \times d$ matrix.

Closed-form vs iterative: when to use which

graph TD
  A["Least squares problem"] --> B{"How large is d?"}
  B -->|"Small: d less than a few hundred"| C["Normal equations: exact, fast"]
  B -->|"Medium"| D["QR or SVD: stable, exact"]
  B -->|"Large: d in thousands or more"| E["Gradient descent or SGD: scalable"]
  C --> F["One-shot computation"]
  D --> F
  E --> G["Iterative: many small steps"]
  style C fill:#ccffcc
  style E fill:#fff3cd

Connection to other topics

PCA and least squares

Principal component analysis (PCA) can be viewed as finding the best low-rank approximation to the data matrix, which is itself a least squares problem. The principal components are directions that minimize the sum of squared distances from the data to the subspace.

Weighted least squares

Sometimes different data points have different importance. Weighted least squares minimizes:

$f(\mathbf{w}) = (X\mathbf{w} - \mathbf{y})^T W (X\mathbf{w} - \mathbf{y})$

where $W$ is a diagonal weight matrix. The solution is:

$\mathbf{w}^* = (X^TWX)^{-1}X^TW\mathbf{y}$

From closed form to iterative

Least squares is the last problem in this series where we can write down the answer directly. From here on, we need iterative algorithms. Gradient descent (steepest descent) is the simplest and most important, and it is the subject of the next article.

Summary

Concept	Key point
Normal equations	$X^TX\mathbf{w} = X^T\mathbf{y}$
Closed-form solution	$\mathbf{w}^* = (X^TX)^{-1}X^T\mathbf{y}$
Geometric view	Project $\mathbf{y}$ onto the column space of $X$
Failure cases	$d > n$ , multicollinearity, ill-conditioning
Ridge fix	$(X^TX + \lambda I)^{-1}X^T\mathbf{y}$ , always invertible
Cost	$O(nd^2 + d^3)$ , use iterative methods for large $d$

Least squares is one of the most important results in all of applied mathematics. It connects statistics (linear regression), geometry (orthogonal projection), and optimization (convex minimization with a closed-form solution) into one elegant framework.

What comes next

The next article introduces steepest descent (gradient descent), the workhorse iterative algorithm for when a closed-form solution does not exist.

← Back to all series