Search…

Dimensionality Reduction: PCA

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

Principal Component Analysis (PCA) takes high-dimensional data and finds a smaller set of axes that capture most of the variation. If your data has 100 features but only 5 underlying patterns, PCA can find those 5 directions. You get simpler models, faster training, and sometimes better generalization.

Prerequisites: You should be comfortable with eigenvalues and eigenvectors and matrix decompositions (especially SVD).

Too many columns: the real-world problem

Imagine you collect data on houses. Each house has ten features.

HouseSq ftBedsBathsLot sizeYear builtGarage spotsFloorsBasementPoolDist to school
11200215000198511NoNo0.5 mi
22400438000201022YesYes1.2 mi
31800326000199521YesNo0.8 mi
432005412000202032YesYes2.0 mi
51000113000197001NoNo0.3 mi

Ten features is already hard to visualize. Real datasets can have hundreds or thousands. Many of these features overlap: bigger houses tend to have more bedrooms, more bathrooms, bigger lots, and more garage spots. The information is redundant.

PCA finds which directions in your data carry the most information and keeps only those. It combines correlated features into a smaller set of new features (called principal components) that capture most of the variation in the data.

PCA projects data from many dimensions down to a few

graph LR
  A["10 original features"] --> B["PCA finds key directions"]
  B --> C["Keep top 2-3 components"]
  C --> D["95% of information retained"]
  D --> E["Simpler, faster models"]

Now let’s see the math behind this idea, starting with what “maximum variance” really means.

The core idea: maximize variance

Given nn data points x1,,xnRdx_1, \ldots, x_n \in \mathbb{R}^d, we want to find a direction uRdu \in \mathbb{R}^d (u=1\|u\| = 1) such that the projections of the data onto uu have maximum variance.

The projection of xix_i onto uu is zi=uTxiz_i = u^T x_i. The variance of these projections (assuming centered data with mean zero) is:

Var(z)=1ni=1n(uTxi)2=uT(1ni=1nxixiT)u=uTΣu\text{Var}(z) = \frac{1}{n}\sum_{i=1}^n (u^T x_i)^2 = u^T \left(\frac{1}{n}\sum_{i=1}^n x_i x_i^T\right) u = u^T \Sigma \, u

where Σ=1nXTX\Sigma = \frac{1}{n} X^T X is the covariance matrix of the centered data.

We want to maximize uTΣuu^T \Sigma \, u subject to u=1\|u\| = 1. Using a Lagrangian:

L(u,λ)=uTΣuλ(uTu1)L(u, \lambda) = u^T \Sigma \, u - \lambda(u^T u - 1)

Setting the gradient to zero:

Σu=λu\Sigma \, u = \lambda u

This is an eigenvalue equation. The direction that maximizes variance is the eigenvector of Σ\Sigma with the largest eigenvalue. The second principal component is the eigenvector with the second largest eigenvalue, and so on.

Correlated data with first principal component direction

The PCA recipe

  1. Center the data: subtract the mean xˉ=1nxi\bar{x} = \frac{1}{n}\sum x_i from each point
  2. Compute the covariance matrix Σ=1nXTX\Sigma = \frac{1}{n}X^TX (where XX is the centered data matrix, each row is a data point)
  3. Find the eigenvalues λ1λ2λd\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_d and eigenvectors u1,u2,,udu_1, u_2, \ldots, u_d
  4. Choose kk components (more on this below)
  5. Project data onto the top kk eigenvectors: zi=UkT(xixˉ)z_i = U_k^T(x_i - \bar{x}) where Uk=[u1  u2    uk]U_k = [u_1 \; u_2 \; \ldots \; u_k]

PCA step by step

graph TD
  A["Raw data matrix X"] --> B["Center: subtract mean from each feature"]
  B --> C["Compute covariance matrix"]
  C --> D["Find eigenvalues and eigenvectors"]
  D --> E["Sort by eigenvalue, largest first"]
  E --> F["Pick top k eigenvectors"]
  F --> G["Project data onto k directions"]

Example 1: PCA on a 4-point 2D dataset

Consider four points in 2D:

Pointxx
p1p_1(0,0)(0, 0)
p2p_2(4,2)(4, 2)
p3p_3(2,4)(2, 4)
p4p_4(2,2)(2, 2)

Step 1: Center the data.

xˉ=(0,0)+(4,2)+(2,4)+(2,2)4=(8,8)4=(2,2)\bar{x} = \frac{(0,0) + (4,2) + (2,4) + (2,2)}{4} = \frac{(8, 8)}{4} = (2, 2)

Centered data:

PointCentered x~\tilde{x}
p1p_1(2,2)(-2, -2)
p2p_2(2,0)(2, 0)
p3p_3(0,2)(0, 2)
p4p_4(0,0)(0, 0)

Step 2: Compute the covariance matrix.

X=[22200200]X = \begin{bmatrix} -2 & -2 \\ 2 & 0 \\ 0 & 2 \\ 0 & 0 \end{bmatrix}

XTX=[(2)2+22+02+02(2)(2)+(2)(0)+(0)(2)+(0)(0)(2)(2)+(2)(0)+(0)(2)+(0)(0)(2)2+02+22+02]=[8448]X^T X = \begin{bmatrix} (-2)^2 + 2^2 + 0^2 + 0^2 & (-2)(-2) + (2)(0) + (0)(2) + (0)(0) \\ (-2)(-2) + (2)(0) + (0)(2) + (0)(0) & (-2)^2 + 0^2 + 2^2 + 0^2 \end{bmatrix} = \begin{bmatrix} 8 & 4 \\ 4 & 8 \end{bmatrix}

Σ=14[8448]=[2112]\Sigma = \frac{1}{4}\begin{bmatrix} 8 & 4 \\ 4 & 8 \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}

Step 3: Find eigenvalues.

det(ΣλI)=(2λ)21=λ24λ+3=(λ3)(λ1)=0\det(\Sigma - \lambda I) = (2 - \lambda)^2 - 1 = \lambda^2 - 4\lambda + 3 = (\lambda - 3)(\lambda - 1) = 0

λ1=3,λ2=1\lambda_1 = 3, \quad \lambda_2 = 1

Step 4: Find eigenvectors.

For λ1=3\lambda_1 = 3:

(23)v1+1v2=0    v2=v1(2 - 3)v_1 + 1 \cdot v_2 = 0 \implies v_2 = v_1

u1=12(11)u_1 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix}

For λ2=1\lambda_2 = 1:

(21)v1+1v2=0    v2=v1(2 - 1)v_1 + 1 \cdot v_2 = 0 \implies v_2 = -v_1

u2=12(11)u_2 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ -1 \end{pmatrix}

Step 5: Variance explained.

PC1 explains:λ1λ1+λ2=34=75%\text{PC1 explains:} \quad \frac{\lambda_1}{\lambda_1 + \lambda_2} = \frac{3}{4} = 75\%

PC2 explains:λ2λ1+λ2=14=25%\text{PC2 explains:} \quad \frac{\lambda_2}{\lambda_1 + \lambda_2} = \frac{1}{4} = 25\%

The first principal component, the direction (1,1)/2(1,1)/\sqrt{2}, captures 75% of the total variance. This makes sense: the data spreads more along the diagonal x1=x2x_1 = x_2 than perpendicular to it.

Variance explained by each principal component

Variance captured by each principal component

graph LR
  A["PC1: 75% of variance"] --> C["Total: 100%"]
  B["PC2: 25% of variance"] --> C

Example 2: projecting onto the first principal component

Using the results from Example 1, project each centered point onto u1=12(1,1)u_1 = \frac{1}{\sqrt{2}}(1, 1):

zi=u1Tx~iz_i = u_1^T \tilde{x}_i

z1=12(1(2)+1(2))=42=222.83z_1 = \frac{1}{\sqrt{2}}(1 \cdot (-2) + 1 \cdot (-2)) = \frac{-4}{\sqrt{2}} = -2\sqrt{2} \approx -2.83

z2=12(12+10)=22=21.41z_2 = \frac{1}{\sqrt{2}}(1 \cdot 2 + 1 \cdot 0) = \frac{2}{\sqrt{2}} = \sqrt{2} \approx 1.41

z3=12(10+12)=22=21.41z_3 = \frac{1}{\sqrt{2}}(1 \cdot 0 + 1 \cdot 2) = \frac{2}{\sqrt{2}} = \sqrt{2} \approx 1.41

z4=12(10+10)=0z_4 = \frac{1}{\sqrt{2}}(1 \cdot 0 + 1 \cdot 0) = 0

We reduced each 2D point to a single number. The 2D dataset is now a 1D dataset: {2.83,1.41,1.41,0}\{-2.83, 1.41, 1.41, 0\}.

Reconstruction from 1D back to 2D

To reconstruct, multiply the 1D coordinate by u1u_1 and add back the mean:

x^i=ziu1+xˉ\hat{x}_i = z_i \cdot u_1 + \bar{x}

PointOriginal1D projection ziz_iReconstructed x^i\hat{x}_iError
p1p_1(0,0)(0, 0)22-2\sqrt{2}(2,2)+(2,2)=(0,0)(-2, -2) + (2,2) = (0, 0)00
p2p_2(4,2)(4, 2)2\sqrt{2}(1,1)+(2,2)=(3,3)(1, 1) + (2,2) = (3, 3)21.41\sqrt{2} \approx 1.41
p3p_3(2,4)(2, 4)2\sqrt{2}(1,1)+(2,2)=(3,3)(1, 1) + (2,2) = (3, 3)21.41\sqrt{2} \approx 1.41
p4p_4(2,2)(2, 2)00(0,0)+(2,2)=(2,2)(0, 0) + (2,2) = (2, 2)00

Points p1p_1 and p4p_4 are reconstructed exactly because they lie on the first principal component direction. Points p2p_2 and p3p_3 lose information in the perpendicular direction.

Total reconstruction error:

ix^ixi2=0+2+2+0=4\sum_i \|\hat{x}_i - x_i\|^2 = 0 + 2 + 2 + 0 = 4

This equals nλ2=4×1=4n \cdot \lambda_2 = 4 \times 1 = 4. The reconstruction error from dropping component kk is always nn times the eigenvalue λk\lambda_k. That is exactly why we keep the components with the largest eigenvalues and drop the ones with the smallest.

Connection to SVD

You do not actually need to compute the covariance matrix to do PCA. The singular value decomposition (SVD) of the centered data matrix gives you everything directly.

If X=USVTX = U S V^T is the SVD of the n×dn \times d centered data matrix, then:

XTX=VSTUTUSVT=VS2VTX^T X = V S^T U^T U S V^T = V S^2 V^T

Since Σ=1nXTX\Sigma = \frac{1}{n} X^T X, the columns of VV are the eigenvectors of Σ\Sigma and the eigenvalues are λi=si2/n\lambda_i = s_i^2 / n where sis_i are the singular values.

Why use SVD instead of eigendecomposition?

  • Numerical stability: Computing XTXX^TX squares the condition number, which amplifies floating-point errors. SVD avoids this.
  • Efficiency for wide data: When dnd \gg n, the covariance matrix Σ\Sigma is d×dd \times d. SVD works directly with the n×dn \times d data matrix.
  • Already implemented: Every numerical library has a fast, robust SVD.

In practice, always use SVD for PCA. Never form the covariance matrix explicitly unless you are teaching someone how PCA works (like right now).

Choosing the number of components

The key question: how many principal components should you keep?

Variance explained threshold. Pick the smallest kk such that:

i=1kλii=1dλithreshold\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i} \geq \text{threshold}

A common threshold is 95%. In our example, keeping just PC1 gives 75% of variance. You would need both components for 100%.

Scree plot. Plot eigenvalues in decreasing order and look for an “elbow” where they drop off. Components after the elbow add little.

Downstream task. Sometimes you choose kk based on what gives the best performance on your actual task (classification accuracy, regression error). Cross-validation helps here.

Before vs after dimensionality reduction

graph LR
  subgraph Before["Before PCA"]
      B1["Feature 1"]
      B2["Feature 2"]
      B3["Feature 3"]
      B4["..."]
      B5["Feature 100"]
  end
  subgraph After["After PCA"]
      A1["PC 1"]
      A2["PC 2"]
      A3["PC 3"]
  end
  Before --> PCA["PCA transform"]
  PCA --> After

PCA in Python

import numpy as np

X = np.array([[0, 0], [4, 2], [2, 4], [2, 2]], dtype=float)

# Center
mean = X.mean(axis=0)
X_centered = X - mean

# SVD
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Eigenvalues (variance explained)
eigenvalues = S**2 / len(X)
print("Eigenvalues:", eigenvalues)      # [3. 1.]
print("PC1 direction:", Vt[0])          # [0.707 0.707]

# Project onto first component
Z = X_centered @ Vt[0]
print("Projections:", Z)                # [-2.83  1.41  1.41  0.0]

# Reconstruct
X_reconstructed = np.outer(Z, Vt[0]) + mean
print("Reconstruction error:", np.sum((X - X_reconstructed)**2))  # 4.0

When PCA helps (and when it doesn’t)

PCA works well when:

  • Features are correlated. PCA exploits correlations to find a compact representation.
  • You want to visualize high-dimensional data in 2D or 3D.
  • You need to reduce computation for a downstream model.
  • As a preprocessing step before regularized regression or classification.

PCA does not help when:

  • Features are already independent and equally important. PCA will find no useful reduction.
  • The structure in the data is nonlinear. PCA finds linear projections. For nonlinear dimensionality reduction, look at t-SNE or UMAP.
  • You need interpretable features. Principal components are linear combinations of all original features, which can be hard to interpret.

Standardization matters

If your features have different scales (height in cm vs. weight in kg), the feature with the largest variance will dominate the first principal component. This is usually not what you want.

Solution: standardize each feature to have mean 0 and standard deviation 1 before applying PCA. This is equivalent to doing PCA on the correlation matrix instead of the covariance matrix.

Summary

ConceptKey idea
GoalFind directions of maximum variance
MathEigenvectors of covariance matrix
Eigenvalue λk\lambda_kVariance captured by the kk-th component
SVD connectionMore stable, avoids forming covariance matrix
Reconstruction errorn×n \times sum of dropped eigenvalues
Choosing kkVariance threshold, scree plot, or cross-validation

What comes next

PCA gives you a deterministic way to reduce dimensions, but it assumes the data lies near a single linear subspace. What if your data comes from a mixture of several groups? In the next post on Gaussian mixture models and the EM algorithm, you will learn how to model data as a soft combination of multiple distributions.

Start typing to search across all content
navigate Enter open Esc close