Dimensionality Reduction: PCA
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
Principal Component Analysis (PCA) takes high-dimensional data and finds a smaller set of axes that capture most of the variation. If your data has 100 features but only 5 underlying patterns, PCA can find those 5 directions. You get simpler models, faster training, and sometimes better generalization.
Prerequisites: You should be comfortable with eigenvalues and eigenvectors and matrix decompositions (especially SVD).
Too many columns: the real-world problem
Imagine you collect data on houses. Each house has ten features.
| House | Sq ft | Beds | Baths | Lot size | Year built | Garage spots | Floors | Basement | Pool | Dist to school |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1200 | 2 | 1 | 5000 | 1985 | 1 | 1 | No | No | 0.5 mi |
| 2 | 2400 | 4 | 3 | 8000 | 2010 | 2 | 2 | Yes | Yes | 1.2 mi |
| 3 | 1800 | 3 | 2 | 6000 | 1995 | 2 | 1 | Yes | No | 0.8 mi |
| 4 | 3200 | 5 | 4 | 12000 | 2020 | 3 | 2 | Yes | Yes | 2.0 mi |
| 5 | 1000 | 1 | 1 | 3000 | 1970 | 0 | 1 | No | No | 0.3 mi |
Ten features is already hard to visualize. Real datasets can have hundreds or thousands. Many of these features overlap: bigger houses tend to have more bedrooms, more bathrooms, bigger lots, and more garage spots. The information is redundant.
PCA finds which directions in your data carry the most information and keeps only those. It combines correlated features into a smaller set of new features (called principal components) that capture most of the variation in the data.
PCA projects data from many dimensions down to a few
graph LR A["10 original features"] --> B["PCA finds key directions"] B --> C["Keep top 2-3 components"] C --> D["95% of information retained"] D --> E["Simpler, faster models"]
Now let’s see the math behind this idea, starting with what “maximum variance” really means.
The core idea: maximize variance
Given data points , we want to find a direction () such that the projections of the data onto have maximum variance.
The projection of onto is . The variance of these projections (assuming centered data with mean zero) is:
where is the covariance matrix of the centered data.
We want to maximize subject to . Using a Lagrangian:
Setting the gradient to zero:
This is an eigenvalue equation. The direction that maximizes variance is the eigenvector of with the largest eigenvalue. The second principal component is the eigenvector with the second largest eigenvalue, and so on.
Correlated data with first principal component direction
The PCA recipe
- Center the data: subtract the mean from each point
- Compute the covariance matrix (where is the centered data matrix, each row is a data point)
- Find the eigenvalues and eigenvectors
- Choose components (more on this below)
- Project data onto the top eigenvectors: where
PCA step by step
graph TD A["Raw data matrix X"] --> B["Center: subtract mean from each feature"] B --> C["Compute covariance matrix"] C --> D["Find eigenvalues and eigenvectors"] D --> E["Sort by eigenvalue, largest first"] E --> F["Pick top k eigenvectors"] F --> G["Project data onto k directions"]
Example 1: PCA on a 4-point 2D dataset
Consider four points in 2D:
| Point | |
|---|---|
Step 1: Center the data.
Centered data:
| Point | Centered |
|---|---|
Step 2: Compute the covariance matrix.
Step 3: Find eigenvalues.
Step 4: Find eigenvectors.
For :
For :
Step 5: Variance explained.
The first principal component, the direction , captures 75% of the total variance. This makes sense: the data spreads more along the diagonal than perpendicular to it.
Variance explained by each principal component
Variance captured by each principal component
graph LR A["PC1: 75% of variance"] --> C["Total: 100%"] B["PC2: 25% of variance"] --> C
Example 2: projecting onto the first principal component
Using the results from Example 1, project each centered point onto :
We reduced each 2D point to a single number. The 2D dataset is now a 1D dataset: .
Reconstruction from 1D back to 2D
To reconstruct, multiply the 1D coordinate by and add back the mean:
| Point | Original | 1D projection | Reconstructed | Error |
|---|---|---|---|---|
Points and are reconstructed exactly because they lie on the first principal component direction. Points and lose information in the perpendicular direction.
Total reconstruction error:
This equals . The reconstruction error from dropping component is always times the eigenvalue . That is exactly why we keep the components with the largest eigenvalues and drop the ones with the smallest.
Connection to SVD
You do not actually need to compute the covariance matrix to do PCA. The singular value decomposition (SVD) of the centered data matrix gives you everything directly.
If is the SVD of the centered data matrix, then:
Since , the columns of are the eigenvectors of and the eigenvalues are where are the singular values.
Why use SVD instead of eigendecomposition?
- Numerical stability: Computing squares the condition number, which amplifies floating-point errors. SVD avoids this.
- Efficiency for wide data: When , the covariance matrix is . SVD works directly with the data matrix.
- Already implemented: Every numerical library has a fast, robust SVD.
In practice, always use SVD for PCA. Never form the covariance matrix explicitly unless you are teaching someone how PCA works (like right now).
Choosing the number of components
The key question: how many principal components should you keep?
Variance explained threshold. Pick the smallest such that:
A common threshold is 95%. In our example, keeping just PC1 gives 75% of variance. You would need both components for 100%.
Scree plot. Plot eigenvalues in decreasing order and look for an “elbow” where they drop off. Components after the elbow add little.
Downstream task. Sometimes you choose based on what gives the best performance on your actual task (classification accuracy, regression error). Cross-validation helps here.
Before vs after dimensionality reduction
graph LR
subgraph Before["Before PCA"]
B1["Feature 1"]
B2["Feature 2"]
B3["Feature 3"]
B4["..."]
B5["Feature 100"]
end
subgraph After["After PCA"]
A1["PC 1"]
A2["PC 2"]
A3["PC 3"]
end
Before --> PCA["PCA transform"]
PCA --> After
PCA in Python
import numpy as np
X = np.array([[0, 0], [4, 2], [2, 4], [2, 2]], dtype=float)
# Center
mean = X.mean(axis=0)
X_centered = X - mean
# SVD
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Eigenvalues (variance explained)
eigenvalues = S**2 / len(X)
print("Eigenvalues:", eigenvalues) # [3. 1.]
print("PC1 direction:", Vt[0]) # [0.707 0.707]
# Project onto first component
Z = X_centered @ Vt[0]
print("Projections:", Z) # [-2.83 1.41 1.41 0.0]
# Reconstruct
X_reconstructed = np.outer(Z, Vt[0]) + mean
print("Reconstruction error:", np.sum((X - X_reconstructed)**2)) # 4.0
When PCA helps (and when it doesn’t)
PCA works well when:
- Features are correlated. PCA exploits correlations to find a compact representation.
- You want to visualize high-dimensional data in 2D or 3D.
- You need to reduce computation for a downstream model.
- As a preprocessing step before regularized regression or classification.
PCA does not help when:
- Features are already independent and equally important. PCA will find no useful reduction.
- The structure in the data is nonlinear. PCA finds linear projections. For nonlinear dimensionality reduction, look at t-SNE or UMAP.
- You need interpretable features. Principal components are linear combinations of all original features, which can be hard to interpret.
Standardization matters
If your features have different scales (height in cm vs. weight in kg), the feature with the largest variance will dominate the first principal component. This is usually not what you want.
Solution: standardize each feature to have mean 0 and standard deviation 1 before applying PCA. This is equivalent to doing PCA on the correlation matrix instead of the covariance matrix.
Summary
| Concept | Key idea |
|---|---|
| Goal | Find directions of maximum variance |
| Math | Eigenvectors of covariance matrix |
| Eigenvalue | Variance captured by the -th component |
| SVD connection | More stable, avoids forming covariance matrix |
| Reconstruction error | sum of dropped eigenvalues |
| Choosing | Variance threshold, scree plot, or cross-validation |
What comes next
PCA gives you a deterministic way to reduce dimensions, but it assumes the data lies near a single linear subspace. What if your data comes from a mixture of several groups? In the next post on Gaussian mixture models and the EM algorithm, you will learn how to model data as a soft combination of multiple distributions.