Search…
Maths for ML · Part 6

Matrix Decompositions: LU, QR, SVD

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

Every matrix is hiding simpler structure inside it. Decompositions are how we reveal that structure. Instead of working with one complicated matrix, we split it into two or three factors that are easier to reason about, easier to compute with, and often more numerically stable.

This article covers three decompositions: LU, QR, and SVD. Each solves a different class of problem, and each shows up constantly in machine learning.

Prerequisites: You should be comfortable with eigenvalues and eigenvectors, matrix multiplication, and dot products.


Why decompose matrices?

Solving Ax=bAx = b directly by computing A1A^{-1} is expensive and numerically fragile. Decompositions give us better alternatives. Factor AA into pieces with known structure (triangular, orthogonal, diagonal), then solve each piece cheaply.

Beyond solving linear systems, decompositions let us:

  • Understand the geometry a matrix encodes (SVD)
  • Solve least squares problems efficiently (QR)
  • Speed up repeated solves with different right-hand sides (LU)
  • Reduce dimensionality in data (PCA relies directly on SVD)

LU Decomposition

LU factors a square matrix AA into a lower triangular matrix LL and an upper triangular matrix UU:

A=LUA = LU

LL has ones on its diagonal and multipliers below. UU is what you get after Gaussian elimination. If you have ever row-reduced a matrix by hand, you already know how to compute LU. The elimination steps give you UU, and the multipliers you used become the entries of LL.

When to use LU

LU is the workhorse for solving Ax=bAx = b when AA is square. Once you have LL and UU, solving becomes two fast triangular solves:

  1. Solve Ly=bLy = b (forward substitution)
  2. Solve Ux=yUx = y (back substitution)

Both steps are O(n2)O(n^2), while the factorization itself is O(n3)O(n^3). If you need to solve Ax=bAx = b for many different bb vectors, you factor once and reuse.

Example 1: LU decomposition of a 3x3 matrix

Decompose AA into LULU:

A=[211433879]A = \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix}

Step 1: Eliminate below the first pivot (2).

Row 2 multiplier: 21=4/2=2\ell_{21} = 4 / 2 = 2

R2R22R1=[44,  32,  32]=[0,1,1]R_2 \leftarrow R_2 - 2 R_1 = [4 - 4,\; 3 - 2,\; 3 - 2] = [0, 1, 1]

Row 3 multiplier: 31=8/2=4\ell_{31} = 8 / 2 = 4

R3R34R1=[88,  74,  94]=[0,3,5]R_3 \leftarrow R_3 - 4 R_1 = [8 - 8,\; 7 - 4,\; 9 - 4] = [0, 3, 5]

Step 2: Eliminate below the second pivot (1).

Row 3 multiplier: 32=3/1=3\ell_{32} = 3 / 1 = 3

R3R33R2=[00,  33,  53]=[0,0,2]R_3 \leftarrow R_3 - 3 R_2 = [0 - 0,\; 3 - 3,\; 5 - 3] = [0, 0, 2]

Step 3: Assemble LL and UU.

L=[100210431],U=[211011002]L = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 4 & 3 & 1 \end{bmatrix}, \quad U = \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 2 \end{bmatrix}

Verify: Multiply LULU:

LU=[12111122+1021+1121+1142+3041+3141+31+12]=[211433879]  LU = \begin{bmatrix} 1 \cdot 2 & 1 \cdot 1 & 1 \cdot 1 \\ 2 \cdot 2 + 1 \cdot 0 & 2 \cdot 1 + 1 \cdot 1 & 2 \cdot 1 + 1 \cdot 1 \\ 4 \cdot 2 + 3 \cdot 0 & 4 \cdot 1 + 3 \cdot 1 & 4 \cdot 1 + 3 \cdot 1 + 1 \cdot 2 \end{bmatrix} = \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix} \; \checkmark

In practice, most implementations use partial pivoting (PA=LUPA = LU) to avoid numerical issues with small pivots. NumPy’s scipy.linalg.lu does this automatically.

import numpy as np
from scipy.linalg import lu

A = np.array([[2, 1, 1], [4, 3, 3], [8, 7, 9]], dtype=float)
P, L, U = lu(A)

print("L =\n", L)
print("U =\n", U)

QR Decomposition

QR factors a matrix AA (not necessarily square) into an orthogonal matrix QQ and an upper triangular matrix RR:

A=QRA = QR

QQ has orthonormal columns, meaning QTQ=IQ^T Q = I. This orthogonality is what makes QR numerically stable.

When to use QR

QR is the standard approach for solving least squares problems. When you want to minimize Axb2\|Ax - b\|^2 and AA is tall (more rows than columns), QR gives you the answer without forming ATAA^T A (which can lose precision). It is also the backbone of the QR algorithm for computing eigenvalues.

How it works: Gram-Schmidt

The classical way to build QQ is the Gram-Schmidt process. You take the columns of AA and orthogonalize them one by one: subtract the projection onto all previous basis vectors, then normalize.

Example 2: QR decomposition of a 3x2 matrix

Decompose:

A=[111001]A = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}

Step 1: Take the first column a1=[1,1,0]T\mathbf{a}_1 = [1, 1, 0]^T.

a1=12+12+02=2\|\mathbf{a}_1\| = \sqrt{1^2 + 1^2 + 0^2} = \sqrt{2} e1=a1a1=[1/21/20]\mathbf{e}_1 = \frac{\mathbf{a}_1}{\|\mathbf{a}_1\|} = \begin{bmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \\ 0 \end{bmatrix}

Step 2: Take the second column a2=[1,0,1]T\mathbf{a}_2 = [1, 0, 1]^T. Subtract its projection onto e1\mathbf{e}_1.

proj=(a2e1)e1=12[1/21/20]=[1/21/20]\text{proj} = (\mathbf{a}_2 \cdot \mathbf{e}_1)\,\mathbf{e}_1 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \\ 0 \end{bmatrix} = \begin{bmatrix} 1/2 \\ 1/2 \\ 0 \end{bmatrix} a2=a2proj=[1/21/21]\mathbf{a}_2' = \mathbf{a}_2 - \text{proj} = \begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix} a2=1/4+1/4+1=3/2=62\|\mathbf{a}_2'\| = \sqrt{1/4 + 1/4 + 1} = \sqrt{3/2} = \frac{\sqrt{6}}{2} e2=a2a2=[1/61/62/6]\mathbf{e}_2 = \frac{\mathbf{a}_2'}{\|\mathbf{a}_2'\|} = \begin{bmatrix} 1/\sqrt{6} \\ -1/\sqrt{6} \\ 2/\sqrt{6} \end{bmatrix}

Step 3: Assemble QQ and RR.

Q=[1/21/61/21/602/6]Q = \begin{bmatrix} 1/\sqrt{2} & 1/\sqrt{6} \\ 1/\sqrt{2} & -1/\sqrt{6} \\ 0 & 2/\sqrt{6} \end{bmatrix}

RR comes from the projections we computed:

R=[21/206/2]R = \begin{bmatrix} \sqrt{2} & 1/\sqrt{2} \\ 0 & \sqrt{6}/2 \end{bmatrix}

Verify: QRQR reproduces AA:

QR=[(1/2)(2)+0(1/2)(1/2)+(1/6)(6/2)(1/2)(2)+0(1/2)(1/2)+(1/6)(6/2)0(2/6)(6/2)]=[111001]  QR = \begin{bmatrix} (1/\sqrt{2})(\sqrt{2}) + 0 & (1/\sqrt{2})(1/\sqrt{2}) + (1/\sqrt{6})(\sqrt{6}/2) \\ (1/\sqrt{2})(\sqrt{2}) + 0 & (1/\sqrt{2})(1/\sqrt{2}) + (-1/\sqrt{6})(\sqrt{6}/2) \\ 0 & (2/\sqrt{6})(\sqrt{6}/2) \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \; \checkmark
import numpy as np

A = np.array([[1, 1], [1, 0], [0, 1]], dtype=float)
Q, R = np.linalg.qr(A)

print("Q =\n", Q)
print("R =\n", R)
print("QR =\n", Q @ R)  # Should match A

Singular Value Decomposition (SVD)

SVD is the most important decomposition in data science. Every matrix, regardless of shape, has an SVD. It factors AA (an m×nm \times n matrix) into:

A=UΣVTA = U \Sigma V^T
  • UU is m×mm \times m orthogonal. Its columns are the left singular vectors.
  • Σ\Sigma is m×nm \times n diagonal. The diagonal entries σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0 are the singular values.
  • VV is n×nn \times n orthogonal. Its columns are the right singular vectors.

SVD structure with matrix shapes:

graph LR
  A["A<br/>m x n"] === U["U<br/>m x m<br/>orthogonal"]
  A === S["Sigma<br/>m x n<br/>diagonal"]
  A === VT["V^T<br/>n x n<br/>orthogonal"]
  style A fill:#fff9c4
  style U fill:#e1f5fe
  style S fill:#c8e6c9
  style VT fill:#e1f5fe

What SVD tells you

Think of any linear transformation AA as three steps: rotate (VTV^T), scale (Σ\Sigma), rotate again (UU). SVD separates these steps. The singular values tell you how much stretching happens along each direction. Large singular values correspond to directions where the data has high variance.

SVD as three geometric steps:

graph LR
  X["Input vector x"] --> VT["V^T: Rotate<br/>Align to singular<br/>vector basis"]
  VT --> SIG["Sigma: Scale<br/>Stretch or shrink<br/>along each axis"]
  SIG --> U["U: Rotate<br/>Map to output<br/>basis"]
  U --> Y["Output Ax"]
  style VT fill:#e1f5fe
  style SIG fill:#c8e6c9
  style U fill:#e1f5fe

Singular value scree plot: the rapid drop-off shows most information is in the first few components

When to use SVD

  • Dimensionality reduction: Keep the top kk singular values and you get the best rank-kk approximation to AA.
  • PCA: Principal Component Analysis is just SVD applied to the centered data matrix. The right singular vectors VV are the principal components.
  • Pseudoinverse: The Moore-Penrose pseudoinverse A+A^+ is computed via SVD.
  • Noise reduction: Small singular values often correspond to noise. Truncating them denoises the data.

SVD and PCA: the connection

If XX is your n×dn \times d data matrix (centered, so each column has mean zero), then:

X=UΣVTX = U \Sigma V^T

The principal components are the columns of VV. The variance explained by the ii-th component is σi2/(n1)\sigma_i^2 / (n - 1). You never need to form the covariance matrix XTXX^T X explicitly. SVD gives you the same information, more stably.

Example 3: SVD of a 2x2 matrix

Compute the SVD of:

A=[2112]A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}

Step 1: Compute ATAA^T A.

ATA=[2112][2112]=[5445]A^T A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} = \begin{bmatrix} 5 & 4 \\ 4 & 5 \end{bmatrix}

Step 2: Find eigenvalues of ATAA^T A.

det(ATAλI)=(5λ)216=λ210λ+9=0\det(A^T A - \lambda I) = (5 - \lambda)^2 - 16 = \lambda^2 - 10\lambda + 9 = 0 λ1=9,λ2=1\lambda_1 = 9, \quad \lambda_2 = 1

The singular values are σ1=9=3\sigma_1 = \sqrt{9} = 3 and σ2=1=1\sigma_2 = \sqrt{1} = 1.

Step 3: Find the right singular vectors (eigenvectors of ATAA^T A).

For λ1=9\lambda_1 = 9: solve (ATA9I)v=0(A^T A - 9I)\mathbf{v} = 0:

[4444]v=0    v1=v2    v1=12[11]\begin{bmatrix} -4 & 4 \\ 4 & -4 \end{bmatrix}\mathbf{v} = 0 \implies v_1 = v_2 \implies \mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix}

For λ2=1\lambda_2 = 1: solve (ATAI)v=0(A^T A - I)\mathbf{v} = 0:

[4444]v=0    v1=v2    v2=12[11]\begin{bmatrix} 4 & 4 \\ 4 & 4 \end{bmatrix}\mathbf{v} = 0 \implies v_1 = -v_2 \implies \mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}

Step 4: Find the left singular vectors: ui=1σiAvi\mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i.

u1=13[2112]12[11]=132[33]=12[11]\mathbf{u}_1 = \frac{1}{3}\begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \frac{1}{3\sqrt{2}}\begin{bmatrix} 3 \\ 3 \end{bmatrix} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \end{bmatrix} u2=11[2112]12[11]=12[11]\mathbf{u}_2 = \frac{1}{1}\begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ -1 \end{bmatrix}

Step 5: Assemble the SVD.

U=12[1111],Σ=[3001],VT=12[1111]U = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}, \quad \Sigma = \begin{bmatrix} 3 & 0 \\ 0 & 1 \end{bmatrix}, \quad V^T = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}

Interpretation: The matrix AA stretches by a factor of 3 along the [1,1]T/2[1, 1]^T / \sqrt{2} direction and by a factor of 1 along [1,1]T/2[1, -1]^T / \sqrt{2}. If this were a data covariance matrix, the first direction captures 9 times more variance than the second.

import numpy as np

A = np.array([[2, 1], [1, 2]], dtype=float)
U, s, Vt = np.linalg.svd(A)

print("U =\n", U)
print("Singular values:", s)
print("V^T =\n", Vt)
print("Reconstructed A =\n", U @ np.diag(s) @ Vt)

Example 4: Low-rank approximation with SVD

SVD gives you the best rank-kk approximation to any matrix (in terms of Frobenius norm). This is the Eckart-Young theorem.

Take a 3×33 \times 3 matrix:

A=[101010101]A = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}

Computing SVD (you can verify with NumPy), the singular values are σ1=2,σ2=1,σ3=0\sigma_1 = 2, \sigma_2 = 1, \sigma_3 = 0.

The fact that σ3=0\sigma_3 = 0 tells us AA is already rank 2. To get the best rank-1 approximation, keep only σ1\sigma_1:

A1=σ1u1v1T=212[101]12[101]=[101000101]A_1 = \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T = 2 \cdot \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} \cdot \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 0 & 0 \\ 1 & 0 & 1 \end{bmatrix}

This rank-1 matrix captures the dominant pattern in AA. The second singular value (σ2=1\sigma_2 = 1) captures the remaining structure in the second row/column.


Choosing the right decomposition

DecompositionRequirementsBest forCost
LUSquare matrixSolving Ax=bAx = b repeatedlyO(n3)O(n^3) factor, O(n2)O(n^2) per solve
QRAny shapeLeast squares, eigenvalue algorithmsO(mn2)O(mn^2) for m×nm \times n
SVDAny shapeDimensionality reduction, pseudoinverse, rankO(mn2)O(mn^2) for m×nm \times n

Rules of thumb:

  • Need to solve Ax=bAx = b for a square AA? Start with LU.
  • Need least squares (overdetermined system)? Use QR. It is more stable than forming the normal equations ATAx=ATbA^T A x = A^T b.
  • Need to understand the structure of your data, reduce dimensions, or handle rank-deficient matrices? Use SVD.

Choosing the right decomposition:

graph TD
  Q["What is your problem?"] --> SQ{"Square matrix<br/>solving Ax = b?"}
  SQ -->|"Yes"| LU["Use LU<br/>Fast repeated solves"]
  SQ -->|"No"| LS{"Least squares<br/>overdetermined system?"}
  LS -->|"Yes"| QR["Use QR<br/>Numerically stable"]
  LS -->|"No"| SVD["Use SVD<br/>Dimensionality reduction,<br/>rank analysis,<br/>pseudoinverse"]
  style LU fill:#e1f5fe
  style QR fill:#c8e6c9
  style SVD fill:#fff9c4

Key properties to remember

LU:

  • det(A)=det(L)det(U)=uii\det(A) = \det(L) \cdot \det(U) = \prod u_{ii} (product of diagonal entries of UU, since LL has ones on the diagonal)
  • Not every matrix has an LU factorization without pivoting

QR:

  • QTQ=IQ^T Q = I, so Q1=QTQ^{-1} = Q^T (cheap to invert)
  • Qx=x\|Qx\| = \|x\| for any xx (orthogonal matrices preserve lengths)
  • Numerically stable for least squares

SVD:

  • Always exists for any matrix
  • Singular values are non-negative and conventionally sorted in decreasing order
  • The rank of AA equals the number of nonzero singular values
  • A2=σ1\|A\|_2 = \sigma_1 (the largest singular value is the spectral norm)
  • AF=σ12+σ22+\|A\|_F = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots} (Frobenius norm relates to all singular values)

What comes next

Now that you know how to decompose matrices, the next step is understanding how to measure vectors and matrices. Head to Norms, Distances, and Similarity to learn about L1L_1, L2L_2, and cosine similarity, all of which build on the ideas from SVD and orthogonality we covered here.

Start typing to search across all content
navigate Enter open Esc close