Feb 23, 2026 · 22 min read · Maths for ML

Matrix Decompositions: LU, QR, SVD

In this series (15 parts)

Every matrix is hiding simpler structure inside it. Decompositions are how we reveal that structure. Instead of working with one complicated matrix, we split it into two or three factors that are easier to reason about, easier to compute with, and often more numerically stable.

This article covers three decompositions: LU, QR, and SVD. Each solves a different class of problem, and each shows up constantly in machine learning.

Prerequisites: You should be comfortable with eigenvalues and eigenvectors, matrix multiplication, and dot products.

Why decompose matrices?

Solving $Ax = b$ directly by computing $A^{-1}$ is expensive and numerically fragile. Decompositions give us better alternatives. Factor $A$ into pieces with known structure (triangular, orthogonal, diagonal), then solve each piece cheaply.

Beyond solving linear systems, decompositions let us:

Understand the geometry a matrix encodes (SVD)
Solve least squares problems efficiently (QR)
Speed up repeated solves with different right-hand sides (LU)
Reduce dimensionality in data (PCA relies directly on SVD)

LU Decomposition

LU factors a square matrix $A$ into a lower triangular matrix $L$ and an upper triangular matrix $U$ :

A = LU

$L$ has ones on its diagonal and multipliers below. $U$ is what you get after Gaussian elimination. If you have ever row-reduced a matrix by hand, you already know how to compute LU. The elimination steps give you $U$ , and the multipliers you used become the entries of $L$ .

When to use LU

LU is the workhorse for solving $Ax = b$ when $A$ is square. Once you have $L$ and $U$ , solving becomes two fast triangular solves:

Solve $Ly = b$ (forward substitution)
Solve $Ux = y$ (back substitution)

Both steps are $O(n^2)$ , while the factorization itself is $O(n^3)$ . If you need to solve $Ax = b$ for many different $b$ vectors, you factor once and reuse.

Example 1: LU decomposition of a 3x3 matrix

Decompose $A$ into $LU$ :

A = \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix}

Step 1: Eliminate below the first pivot (2).

Row 2 multiplier: $\ell_{21} = 4 / 2 = 2$

R_2 \leftarrow R_2 - 2 R_1 = [4 - 4,\; 3 - 2,\; 3 - 2] = [0, 1, 1]

Row 3 multiplier: $\ell_{31} = 8 / 2 = 4$

R_3 \leftarrow R_3 - 4 R_1 = [8 - 8,\; 7 - 4,\; 9 - 4] = [0, 3, 5]

Step 2: Eliminate below the second pivot (1).

Row 3 multiplier: $\ell_{32} = 3 / 1 = 3$

R_3 \leftarrow R_3 - 3 R_2 = [0 - 0,\; 3 - 3,\; 5 - 3] = [0, 0, 2]

Step 3: Assemble $L$ and $U$ .

L = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 4 & 3 & 1 \end{bmatrix}, \quad U = \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 2 \end{bmatrix}

Verify: Multiply $LU$ :

LU = \begin{bmatrix} 1 \cdot 2 & 1 \cdot 1 & 1 \cdot 1 \\ 2 \cdot 2 + 1 \cdot 0 & 2 \cdot 1 + 1 \cdot 1 & 2 \cdot 1 + 1 \cdot 1 \\ 4 \cdot 2 + 3 \cdot 0 & 4 \cdot 1 + 3 \cdot 1 & 4 \cdot 1 + 3 \cdot 1 + 1 \cdot 2 \end{bmatrix} = \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix} \; \checkmark

In practice, most implementations use partial pivoting ( $PA = LU$ ) to avoid numerical issues with small pivots. NumPy’s scipy.linalg.lu does this automatically.

import numpy as np
from scipy.linalg import lu

A = np.array([[2, 1, 1], [4, 3, 3], [8, 7, 9]], dtype=float)
P, L, U = lu(A)

print("L =\n", L)
print("U =\n", U)

QR Decomposition

QR factors a matrix $A$ (not necessarily square) into an orthogonal matrix $Q$ and an upper triangular matrix $R$ :

A = QR

$Q$ has orthonormal columns, meaning $Q^T Q = I$ . This orthogonality is what makes QR numerically stable.

When to use QR

QR is the standard approach for solving least squares problems. When you want to minimize $\|Ax - b\|^2$ and $A$ is tall (more rows than columns), QR gives you the answer without forming $A^T A$ (which can lose precision). It is also the backbone of the QR algorithm for computing eigenvalues.

How it works: Gram-Schmidt

The classical way to build $Q$ is the Gram-Schmidt process. You take the columns of $A$ and orthogonalize them one by one: subtract the projection onto all previous basis vectors, then normalize.

Example 2: QR decomposition of a 3x2 matrix

Decompose:

A = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}

Step 1: Take the first column $\mathbf{a}_1 = [1, 1, 0]^T$ .

\|\mathbf{a}_1\| = \sqrt{1^2 + 1^2 + 0^2} = \sqrt{2}

\mathbf{e}_1 = \frac{\mathbf{a}_1}{\|\mathbf{a}_1\|} = \begin{bmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \\ 0 \end{bmatrix}

Step 2: Take the second column $\mathbf{a}_2 = [1, 0, 1]^T$ . Subtract its projection onto $\mathbf{e}_1$ .

\text{proj} = (\mathbf{a}_2 \cdot \mathbf{e}_1)\,\mathbf{e}_1 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \\ 0 \end{bmatrix} = \begin{bmatrix} 1/2 \\ 1/2 \\ 0 \end{bmatrix}

\mathbf{a}_2' = \mathbf{a}_2 - \text{proj} = \begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix}

\|\mathbf{a}_2'\| = \sqrt{1/4 + 1/4 + 1} = \sqrt{3/2} = \frac{\sqrt{6}}{2}

\mathbf{e}_2 = \frac{\mathbf{a}_2'}{\|\mathbf{a}_2'\|} = \begin{bmatrix} 1/\sqrt{6} \\ -1/\sqrt{6} \\ 2/\sqrt{6} \end{bmatrix}

Step 3: Assemble $Q$ and $R$ .

Q = \begin{bmatrix} 1/\sqrt{2} & 1/\sqrt{6} \\ 1/\sqrt{2} & -1/\sqrt{6} \\ 0 & 2/\sqrt{6} \end{bmatrix}

$R$ comes from the projections we computed:

R = \begin{bmatrix} \sqrt{2} & 1/\sqrt{2} \\ 0 & \sqrt{6}/2 \end{bmatrix}

Verify: $QR$ reproduces $A$ :

QR = \begin{bmatrix} (1/\sqrt{2})(\sqrt{2}) + 0 & (1/\sqrt{2})(1/\sqrt{2}) + (1/\sqrt{6})(\sqrt{6}/2) \\ (1/\sqrt{2})(\sqrt{2}) + 0 & (1/\sqrt{2})(1/\sqrt{2}) + (-1/\sqrt{6})(\sqrt{6}/2) \\ 0 & (2/\sqrt{6})(\sqrt{6}/2) \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \; \checkmark

import numpy as np

A = np.array([[1, 1], [1, 0], [0, 1]], dtype=float)
Q, R = np.linalg.qr(A)

print("Q =\n", Q)
print("R =\n", R)
print("QR =\n", Q @ R)  # Should match A

Singular Value Decomposition (SVD)

SVD is the most important decomposition in data science. Every matrix, regardless of shape, has an SVD. It factors $A$ (an $m \times n$ matrix) into:

A = U \Sigma V^T

$U$ is $m \times m$ orthogonal. Its columns are the left singular vectors.
$\Sigma$ is $m \times n$ diagonal. The diagonal entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$ are the singular values.
$V$ is $n \times n$ orthogonal. Its columns are the right singular vectors.

SVD structure with matrix shapes:

graph LR
  A["A<br/>m x n"] === U["U<br/>m x m<br/>orthogonal"]
  A === S["Sigma<br/>m x n<br/>diagonal"]
  A === VT["V^T<br/>n x n<br/>orthogonal"]
  style A fill:#fff9c4
  style U fill:#e1f5fe
  style S fill:#c8e6c9
  style VT fill:#e1f5fe

What SVD tells you

Think of any linear transformation $A$ as three steps: rotate ( $V^T$ ), scale ( $\Sigma$ ), rotate again ( $U$ ). SVD separates these steps. The singular values tell you how much stretching happens along each direction. Large singular values correspond to directions where the data has high variance.

SVD as three geometric steps:

graph LR
  X["Input vector x"] --> VT["V^T: Rotate<br/>Align to singular<br/>vector basis"]
  VT --> SIG["Sigma: Scale<br/>Stretch or shrink<br/>along each axis"]
  SIG --> U["U: Rotate<br/>Map to output<br/>basis"]
  U --> Y["Output Ax"]
  style VT fill:#e1f5fe
  style SIG fill:#c8e6c9
  style U fill:#e1f5fe

Singular value scree plot: the rapid drop-off shows most information is in the first few components

When to use SVD

Dimensionality reduction: Keep the top $k$ singular values and you get the best rank- $k$ approximation to $A$ .
PCA: Principal Component Analysis is just SVD applied to the centered data matrix. The right singular vectors $V$ are the principal components.
Pseudoinverse: The Moore-Penrose pseudoinverse $A^+$ is computed via SVD.
Noise reduction: Small singular values often correspond to noise. Truncating them denoises the data.

SVD and PCA: the connection

If $X$ is your $n \times d$ data matrix (centered, so each column has mean zero), then:

X = U \Sigma V^T

The principal components are the columns of $V$ . The variance explained by the $i$ -th component is $\sigma_i^2 / (n - 1)$ . You never need to form the covariance matrix $X^T X$ explicitly. SVD gives you the same information, more stably.

Example 3: SVD of a 2x2 matrix

Compute the SVD of:

A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}

Step 1: Compute $A^T A$ .

A^T A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 5 & 4 \\ 4 & 5 \end{bmatrix}

Step 2: Find eigenvalues of $A^T A$ .

\det(A^T A - \lambda I) = (5 - \lambda)^2 - 16 = \lambda^2 - 10\lambda + 9 = 0

\lambda_1 = 9, \quad \lambda_2 = 1

The singular values are $\sigma_1 = \sqrt{9} = 3$ and $\sigma_2 = \sqrt{1} = 1$ .

Step 3: Find the right singular vectors (eigenvectors of $A^T A$ ).

For $\lambda_1 = 9$ : solve $(A^T A - 9I)\mathbf{v} = 0$ :

\begin{bmatrix} -4 & 4 \\ 4 & -4 \end{bmatrix}\mathbf{v} = 0 \implies v_1 = v_2 \implies \mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}

For $\lambda_2 = 1$ : solve $(A^T A - I)\mathbf{v} = 0$ :

\begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix}\mathbf{v} = 0 \implies v_1 = -v_2 \implies \mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}

Step 4: Find the left singular vectors: $\mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i$ .

\mathbf{u}_1 = \frac{1}{3}\begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \frac{1}{3\sqrt{2}}\begin{bmatrix} 3 \\ 3 \end{bmatrix} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}

\mathbf{u}_2 = \frac{1}{1}\begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}

Step 5: Assemble the SVD.

U = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}, \quad \Sigma = \begin{bmatrix} 3 & 0 \\ 0 & 1 \end{bmatrix}, \quad V^T = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}

Interpretation: The matrix $A$ stretches by a factor of 3 along the $[1, 1]^T / \sqrt{2}$ direction and by a factor of 1 along $[1, -1]^T / \sqrt{2}$ . If this were a data covariance matrix, the first direction captures 9 times more variance than the second.

import numpy as np

A = np.array([[2, 1], [1, 2]], dtype=float)
U, s, Vt = np.linalg.svd(A)

print("U =\n", U)
print("Singular values:", s)
print("V^T =\n", Vt)
print("Reconstructed A =\n", U @ np.diag(s) @ Vt)

Example 4: Low-rank approximation with SVD

SVD gives you the best rank- $k$ approximation to any matrix (in terms of Frobenius norm). This is the Eckart-Young theorem.

Take a $3 \times 3$ matrix:

A = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Computing SVD (you can verify with NumPy), the singular values are $\sigma_1 = 2, \sigma_2 = 1, \sigma_3 = 0$ .

The fact that $\sigma_3 = 0$ tells us $A$ is already rank 2. To get the best rank-1 approximation, keep only $\sigma_1$ :

A_1 = \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T = 2 \cdot \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} \cdot \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 0 & 0 \\ 1 & 0 & 1 \end{bmatrix}

This rank-1 matrix captures the dominant pattern in $A$ . The second singular value ( $\sigma_2 = 1$ ) captures the remaining structure in the second row/column.

Choosing the right decomposition

Decomposition	Requirements	Best for	Cost
LU	Square matrix	Solving $Ax = b$ repeatedly	$O(n^3)$ factor, $O(n^2)$ per solve
QR	Any shape	Least squares, eigenvalue algorithms	$O(mn^2)$ for $m \times n$
SVD	Any shape	Dimensionality reduction, pseudoinverse, rank	$O(mn^2)$ for $m \times n$

Rules of thumb:

Need to solve $Ax = b$ for a square $A$ ? Start with LU.
Need least squares (overdetermined system)? Use QR. It is more stable than forming the normal equations $A^T A x = A^T b$ .
Need to understand the structure of your data, reduce dimensions, or handle rank-deficient matrices? Use SVD.

Choosing the right decomposition:

graph TD
  Q["What is your problem?"] --> SQ{"Square matrix<br/>solving Ax = b?"}
  SQ -->|"Yes"| LU["Use LU<br/>Fast repeated solves"]
  SQ -->|"No"| LS{"Least squares<br/>overdetermined system?"}
  LS -->|"Yes"| QR["Use QR<br/>Numerically stable"]
  LS -->|"No"| SVD["Use SVD<br/>Dimensionality reduction,<br/>rank analysis,<br/>pseudoinverse"]
  style LU fill:#e1f5fe
  style QR fill:#c8e6c9
  style SVD fill:#fff9c4

Key properties to remember

LU:

$\det(A) = \det(L) \cdot \det(U) = \prod u_{ii}$ (product of diagonal entries of $U$ , since $L$ has ones on the diagonal)
Not every matrix has an LU factorization without pivoting

QR:

$Q^T Q = I$ , so $Q^{-1} = Q^T$ (cheap to invert)
$\|Qx\| = \|x\|$ for any $x$ (orthogonal matrices preserve lengths)
Numerically stable for least squares

SVD:

Always exists for any matrix
Singular values are non-negative and conventionally sorted in decreasing order
The rank of $A$ equals the number of nonzero singular values
$\|A\|_2 = \sigma_1$ (the largest singular value is the spectral norm)
$\|A\|_F = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots}$ (Frobenius norm relates to all singular values)

What comes next

Now that you know how to decompose matrices, the next step is understanding how to measure vectors and matrices. Head to Norms, Distances, and Similarity to learn about $L_1$ , $L_2$ , and cosine similarity, all of which build on the ideas from SVD and orthogonality we covered here.

← Back to all series