Apr 6, 2026 · 20 min read · Machine Learning

Dimensionality Reduction: PCA

In this series (18 parts)

Principal Component Analysis (PCA) takes high-dimensional data and finds a smaller set of axes that capture most of the variation. If your data has 100 features but only 5 underlying patterns, PCA can find those 5 directions. You get simpler models, faster training, and sometimes better generalization.

Prerequisites: You should be comfortable with eigenvalues and eigenvectors and matrix decompositions (especially SVD).

Too many columns: the real-world problem

Imagine you collect data on houses. Each house has ten features.

House	Sq ft	Beds	Baths	Lot size	Year built	Garage spots	Floors	Basement	Pool	Dist to school
1	1200	2	1	5000	1985	1	1	No	No	0.5 mi
2	2400	4	3	8000	2010	2	2	Yes	Yes	1.2 mi
3	1800	3	2	6000	1995	2	1	Yes	No	0.8 mi
4	3200	5	4	12000	2020	3	2	Yes	Yes	2.0 mi
5	1000	1	1	3000	1970	0	1	No	No	0.3 mi

Ten features is already hard to visualize. Real datasets can have hundreds or thousands. Many of these features overlap: bigger houses tend to have more bedrooms, more bathrooms, bigger lots, and more garage spots. The information is redundant.

PCA finds which directions in your data carry the most information and keeps only those. It combines correlated features into a smaller set of new features (called principal components) that capture most of the variation in the data.

PCA projects data from many dimensions down to a few

graph LR
  A["10 original features"] --> B["PCA finds key directions"]
  B --> C["Keep top 2-3 components"]
  C --> D["95% of information retained"]
  D --> E["Simpler, faster models"]

Now let’s see the math behind this idea, starting with what “maximum variance” really means.

The core idea: maximize variance

Given $n$ data points $x_1, \ldots, x_n \in \mathbb{R}^d$ , we want to find a direction $u \in \mathbb{R}^d$ ( $\|u\| = 1$ ) such that the projections of the data onto $u$ have maximum variance.

The projection of $x_i$ onto $u$ is $z_i = u^T x_i$ . The variance of these projections (assuming centered data with mean zero) is:

$\text{Var}(z) = \frac{1}{n}\sum_{i=1}^n (u^T x_i)^2 = u^T \left(\frac{1}{n}\sum_{i=1}^n x_i x_i^T\right) u = u^T \Sigma \, u$

where $\Sigma = \frac{1}{n} X^T X$ is the covariance matrix of the centered data.

We want to maximize $u^T \Sigma \, u$ subject to $\|u\| = 1$ . Using a Lagrangian:

$L(u, \lambda) = u^T \Sigma \, u - \lambda(u^T u - 1)$

Setting the gradient to zero:

$\Sigma \, u = \lambda u$

This is an eigenvalue equation. The direction that maximizes variance is the eigenvector of $\Sigma$ with the largest eigenvalue. The second principal component is the eigenvector with the second largest eigenvalue, and so on.

Correlated data with first principal component direction

The PCA recipe

Center the data: subtract the mean $\bar{x} = \frac{1}{n}\sum x_i$ from each point
Compute the covariance matrix $\Sigma = \frac{1}{n}X^TX$ (where $X$ is the centered data matrix, each row is a data point)
Find the eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_d$ and eigenvectors $u_1, u_2, \ldots, u_d$
Choose $k$ components (more on this below)
Project data onto the top $k$ eigenvectors: $z_i = U_k^T(x_i - \bar{x})$ where $U_k = [u_1 \; u_2 \; \ldots \; u_k]$

PCA step by step

graph TD
  A["Raw data matrix X"] --> B["Center: subtract mean from each feature"]
  B --> C["Compute covariance matrix"]
  C --> D["Find eigenvalues and eigenvectors"]
  D --> E["Sort by eigenvalue, largest first"]
  E --> F["Pick top k eigenvectors"]
  F --> G["Project data onto k directions"]

Example 1: PCA on a 4-point 2D dataset

Consider four points in 2D:

Point	$x$
$p_1$	$(0, 0)$
$p_2$	$(4, 2)$
$p_3$	$(2, 4)$
$p_4$	$(2, 2)$

Step 1: Center the data.

$\bar{x} = \frac{(0,0) + (4,2) + (2,4) + (2,2)}{4} = \frac{(8, 8)}{4} = (2, 2)$

Centered data:

Point	Centered $\tilde{x}$
$p_1$	$(-2, -2)$
$p_2$	$(2, 0)$
$p_3$	$(0, 2)$
$p_4$	$(0, 0)$

Step 2: Compute the covariance matrix.

$X = \begin{bmatrix} -2 & -2 \\ 2 & 0 \\ 0 & 2 \\ 0 & 0 \end{bmatrix}$

$X^T X = \begin{bmatrix} (-2)^2 + 2^2 + 0^2 + 0^2 & (-2)(-2) + (2)(0) + (0)(2) + (0)(0) \\ (-2)(-2) + (2)(0) + (0)(2) + (0)(0) & (-2)^2 + 0^2 + 2^2 + 0^2 \end{bmatrix} = \begin{bmatrix} 8 & 4 \\ 4 & 8 \end{bmatrix}$

$\Sigma = \frac{1}{4}\begin{bmatrix} 8 & 4 \\ 4 & 8 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}$

Step 3: Find eigenvalues.

$\det(\Sigma - \lambda I) = (2 - \lambda)^2 - 1 = \lambda^2 - 4\lambda + 3 = (\lambda - 3)(\lambda - 1) = 0$

$\lambda_1 = 3, \quad \lambda_2 = 1$

Step 4: Find eigenvectors.

For $\lambda_1 = 3$ :

$(2 - 3)v_1 + 1 \cdot v_2 = 0 \implies v_2 = v_1$

$u_1 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix}$

For $\lambda_2 = 1$ :

$(2 - 1)v_1 + 1 \cdot v_2 = 0 \implies v_2 = -v_1$

$u_2 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ -1 \end{pmatrix}$

Step 5: Variance explained.

$\text{PC1 explains:} \quad \frac{\lambda_1}{\lambda_1 + \lambda_2} = \frac{3}{4} = 75\%$

$\text{PC2 explains:} \quad \frac{\lambda_2}{\lambda_1 + \lambda_2} = \frac{1}{4} = 25\%$

The first principal component, the direction $(1,1)/\sqrt{2}$ , captures 75% of the total variance. This makes sense: the data spreads more along the diagonal $x_1 = x_2$ than perpendicular to it.

Variance explained by each principal component

Variance captured by each principal component

graph LR
  A["PC1: 75% of variance"] --> C["Total: 100%"]
  B["PC2: 25% of variance"] --> C

Example 2: projecting onto the first principal component

Using the results from Example 1, project each centered point onto $u_1 = \frac{1}{\sqrt{2}}(1, 1)$ :

$z_i = u_1^T \tilde{x}_i$

$z_1 = \frac{1}{\sqrt{2}}(1 \cdot (-2) + 1 \cdot (-2)) = \frac{-4}{\sqrt{2}} = -2\sqrt{2} \approx -2.83$

$z_2 = \frac{1}{\sqrt{2}}(1 \cdot 2 + 1 \cdot 0) = \frac{2}{\sqrt{2}} = \sqrt{2} \approx 1.41$

$z_3 = \frac{1}{\sqrt{2}}(1 \cdot 0 + 1 \cdot 2) = \frac{2}{\sqrt{2}} = \sqrt{2} \approx 1.41$

$z_4 = \frac{1}{\sqrt{2}}(1 \cdot 0 + 1 \cdot 0) = 0$

We reduced each 2D point to a single number. The 2D dataset is now a 1D dataset: $\{-2.83, 1.41, 1.41, 0\}$ .

Reconstruction from 1D back to 2D

To reconstruct, multiply the 1D coordinate by $u_1$ and add back the mean:

$\hat{x}_i = z_i \cdot u_1 + \bar{x}$

Point	Original	1D projection $z_i$	Reconstructed $\hat{x}_i$	Error
$p_1$	$(0, 0)$	$-2\sqrt{2}$	$(-2, -2) + (2,2) = (0, 0)$	$0$
$p_2$	$(4, 2)$	$\sqrt{2}$	$(1, 1) + (2,2) = (3, 3)$	$\sqrt{2} \approx 1.41$
$p_3$	$(2, 4)$	$\sqrt{2}$	$(1, 1) + (2,2) = (3, 3)$	$\sqrt{2} \approx 1.41$
$p_4$	$(2, 2)$	$0$	$(0, 0) + (2,2) = (2, 2)$	$0$

Points $p_1$ and $p_4$ are reconstructed exactly because they lie on the first principal component direction. Points $p_2$ and $p_3$ lose information in the perpendicular direction.

Total reconstruction error:

$\sum_i \|\hat{x}_i - x_i\|^2 = 0 + 2 + 2 + 0 = 4$

This equals $n \cdot \lambda_2 = 4 \times 1 = 4$ . The reconstruction error from dropping component $k$ is always $n$ times the eigenvalue $\lambda_k$ . That is exactly why we keep the components with the largest eigenvalues and drop the ones with the smallest.

Connection to SVD

You do not actually need to compute the covariance matrix to do PCA. The singular value decomposition (SVD) of the centered data matrix gives you everything directly.

If $X = U S V^T$ is the SVD of the $n \times d$ centered data matrix, then:

$X^T X = V S^T U^T U S V^T = V S^2 V^T$

Since $\Sigma = \frac{1}{n} X^T X$ , the columns of $V$ are the eigenvectors of $\Sigma$ and the eigenvalues are $\lambda_i = s_i^2 / n$ where $s_i$ are the singular values.

Why use SVD instead of eigendecomposition?

Numerical stability: Computing $X^TX$ squares the condition number, which amplifies floating-point errors. SVD avoids this.
Efficiency for wide data: When $d \gg n$ , the covariance matrix $\Sigma$ is $d \times d$ . SVD works directly with the $n \times d$ data matrix.
Already implemented: Every numerical library has a fast, robust SVD.

In practice, always use SVD for PCA. Never form the covariance matrix explicitly unless you are teaching someone how PCA works (like right now).

Choosing the number of components

The key question: how many principal components should you keep?

Variance explained threshold. Pick the smallest $k$ such that:

$\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i} \geq \text{threshold}$

A common threshold is 95%. In our example, keeping just PC1 gives 75% of variance. You would need both components for 100%.

Scree plot. Plot eigenvalues in decreasing order and look for an “elbow” where they drop off. Components after the elbow add little.

Downstream task. Sometimes you choose $k$ based on what gives the best performance on your actual task (classification accuracy, regression error). Cross-validation helps here.

Before vs after dimensionality reduction

graph LR
  subgraph Before["Before PCA"]
      B1["Feature 1"]
      B2["Feature 2"]
      B3["Feature 3"]
      B4["..."]
      B5["Feature 100"]
  end
  subgraph After["After PCA"]
      A1["PC 1"]
      A2["PC 2"]
      A3["PC 3"]
  end
  Before --> PCA["PCA transform"]
  PCA --> After

PCA in Python

import numpy as np

X = np.array([[0, 0], [4, 2], [2, 4], [2, 2]], dtype=float)

# Center
mean = X.mean(axis=0)
X_centered = X - mean

# SVD
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Eigenvalues (variance explained)
eigenvalues = S**2 / len(X)
print("Eigenvalues:", eigenvalues)      # [3. 1.]
print("PC1 direction:", Vt[0])          # [0.707 0.707]

# Project onto first component
Z = X_centered @ Vt[0]
print("Projections:", Z)                # [-2.83  1.41  1.41  0.0]

# Reconstruct
X_reconstructed = np.outer(Z, Vt[0]) + mean
print("Reconstruction error:", np.sum((X - X_reconstructed)**2))  # 4.0

When PCA helps (and when it doesn’t)

PCA works well when:

Features are correlated. PCA exploits correlations to find a compact representation.
You want to visualize high-dimensional data in 2D or 3D.
You need to reduce computation for a downstream model.
As a preprocessing step before regularized regression or classification.

PCA does not help when:

Features are already independent and equally important. PCA will find no useful reduction.
The structure in the data is nonlinear. PCA finds linear projections. For nonlinear dimensionality reduction, look at t-SNE or UMAP.
You need interpretable features. Principal components are linear combinations of all original features, which can be hard to interpret.

Standardization matters

If your features have different scales (height in cm vs. weight in kg), the feature with the largest variance will dominate the first principal component. This is usually not what you want.

Solution: standardize each feature to have mean 0 and standard deviation 1 before applying PCA. This is equivalent to doing PCA on the correlation matrix instead of the covariance matrix.

Summary

Concept	Key idea
Goal	Find directions of maximum variance
Math	Eigenvectors of covariance matrix
Eigenvalue $\lambda_k$	Variance captured by the $k$ -th component
SVD connection	More stable, avoids forming covariance matrix
Reconstruction error	$n \times$ sum of dropped eigenvalues
Choosing $k$	Variance threshold, scree plot, or cross-validation

What comes next

PCA gives you a deterministic way to reduce dimensions, but it assumes the data lies near a single linear subspace. What if your data comes from a mixture of several groups? In the next post on Gaussian mixture models and the EM algorithm, you will learn how to model data as a soft combination of multiple distributions.

← Back to all series