Apr 18, 2026 · 20 min read · ML / Math

Unsupervised Learning - Finding Hidden Patterns

In this series (5 parts)

Introduction to Machine Learning
Supervised Learning - Learning from Labeled Data
Regression - Predicting Continuous Values
Classification - Predicting Categories
Unsupervised Learning - Finding Hidden Patterns

So far in this series, we’ve worked with labeled data - every training example had a known answer. But what happens when you have data with no labels at all?

Welcome to unsupervised learning.

What is Unsupervised Learning?

Unsupervised learning is about finding hidden structure in data without being told what to look for. There are no labels, no “correct answers” - just raw data, and the model’s job is to discover patterns.

flowchart LR
  A["Unlabeled Data
(no correct answers)"] --> B["Unsupervised
Algorithm"]
  B --> C["Discovered Structure
(clusters, patterns,
reduced dimensions)"]
  style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Supervised vs Unsupervised

Aspect	Supervised	Unsupervised
Data	Labeled (input + output)	Unlabeled (input only)
Goal	Predict known outcomes	Discover hidden patterns
Feedback	Model knows if it’s right/wrong	No explicit feedback
Examples	Spam detection, price prediction	Customer segmentation, anomaly detection
Evaluation	Accuracy, F1, RMSE	Silhouette score, inertia, visual inspection

Why Unsupervised Learning Matters

In the real world, most data is unlabeled. Labeling data is expensive, slow, and sometimes impossible:

A hospital has millions of patient records but no one has tagged each record with disease labels
An e-commerce site has billions of clickstreams but hasn’t categorized every user
A security system has network logs but doesn’t know which are attacks

Unsupervised learning lets you extract value from raw, unlabeled data.

Types of Unsupervised Learning

flowchart TD
  UL["Unsupervised Learning"] --> CL["Clustering"]
  UL --> DR["Dimensionality Reduction"]
  UL --> AD["Anomaly Detection"]
  UL --> AR["Association Rules"]
  CL --> KM["K-Means"]
  CL --> HC["Hierarchical"]
  CL --> DB["DBSCAN"]
  DR --> PCA["PCA"]
  DR --> TSNE["t-SNE"]
  DR --> UMAP["UMAP"]
  AD --> IF["Isolation Forest"]
  AD --> AE["Autoencoders"]
  AR --> APR["Apriori"]
  style UL fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a
  style CL fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style DR fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style AD fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style AR fill:#fde8e8,stroke:#7a1a1a,color:#7a1a1a

Real-World Applications

Application	Technique	What It Discovers
Customer segmentation	Clustering	Groups of similar customers for targeted marketing
Fraud detection	Anomaly detection	Unusual transactions that deviate from normal patterns
Document organization	Topic modeling	Themes across thousands of documents
Gene expression analysis	Clustering	Groups of genes with similar behavior
Image compression	Dimensionality reduction	Compressed representation of images
Market basket analysis	Association rules	Products frequently bought together
Social network analysis	Clustering	Communities within a network
Data visualization	t-SNE / UMAP	2D maps of high-dimensional data

Clustering: Grouping Similar Data

Clustering is the most common unsupervised task. The goal: group data points so that points in the same group (cluster) are more similar to each other than to points in other groups.

K-Means Clustering

The most popular clustering algorithm. Here’s how it works:

flowchart TD
  A["1. Choose K
(number of clusters)"] --> B["2. Randomly place K
cluster centers (centroids)"]
  B --> C["3. Assign each point
to nearest centroid"]
  C --> D["4. Move each centroid
to mean of its points"]
  D --> E{"Centroids
moved?"}
  E -->|"Yes"| C
  E -->|"No"| F["Done! Clusters found"]
  style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style C fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Example: Customer Segmentation

An e-commerce company has customer data with no predefined groups:

Customer	Annual Spend ($)	Visits/Month	Avg Order ($)	Items/Order
A	12,000	20	600	8
B	800	2	400	3
C	15,000	25	600	10
D	500	1	500	2
E	3,000	8	375	5
F	11,000	18	611	7
G	2,500	6	417	4
H	600	2	300	2

After running K-Means with K=3:

Cluster	Customers	Profile	Marketing Strategy
Premium	A, C, F	High spend, frequent visits	VIP rewards, early access
Regular	E, G	Medium spend, moderate visits	Loyalty programs, upsells
Occasional	B, D, H	Low spend, rare visits	Re-engagement campaigns

No one told the algorithm these groups existed - it discovered them from the data.

Choosing K: The Elbow Method

How do you know the right number of clusters? Plot the inertia (sum of squared distances to nearest centroid) for different values of K:

K	Inertia	Change
1	1000	-
2	500	-500
3	250	-250
4	200	-50
5	180	-20
6	170	-10

The “elbow” is at K=3 - after that, adding more clusters gives diminishing returns.

K-Means Limitations

Limitation	Description
Must specify K	You need to choose the number of clusters in advance
Assumes spherical clusters	Struggles with elongated or irregular shapes
Sensitive to initialization	Different starting centroids → different results
Sensitive to outliers	Outliers can pull centroids away from true centers
Only finds convex clusters	Can’t discover non-convex shapes

Hierarchical Clustering

Unlike K-Means, hierarchical clustering doesn’t require you to specify K upfront. It builds a tree (dendrogram) of clusters.

flowchart TD
  A["Agglomerative
(Bottom-Up)"] --> A1["Start: each point is
its own cluster"]
  A1 --> A2["Merge the two closest
clusters"]
  A2 --> A3["Repeat until one
cluster remains"]
  B["Divisive
(Top-Down)"] --> B1["Start: all points in
one cluster"]
  B1 --> B2["Split the least
coherent cluster"]
  B2 --> B3["Repeat until each point
is its own cluster"]
  style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276

You choose the number of clusters by “cutting” the dendrogram at a desired height.

DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of arbitrary shape and automatically identifies outliers.

Parameter	Description
ε (epsilon)	Maximum distance between two points to be neighbors
MinPts	Minimum points needed to form a dense region

How DBSCAN Classifies Points

Point Type	Definition
Core point	Has at least MinPts neighbors within ε
Border point	Within ε of a core point, but fewer than MinPts neighbors
Noise	Neither core nor border - an outlier

K-Means vs DBSCAN

Aspect	K-Means	DBSCAN
Requires K?	Yes	No
Cluster shape	Spherical	Arbitrary
Handles noise?	No	Yes (labels as outliers)
Handles varying density?	No	Limited
Speed	Fast	Moderate

Dimensionality Reduction

Real-world datasets often have dozens or hundreds of features. Dimensionality reduction compresses data to fewer dimensions while preserving important information.

Why Reduce Dimensions?

Visualization: Plot high-dimensional data in 2D/3D
Speed: Fewer features → faster training
Noise reduction: Remove irrelevant features
Avoid curse of dimensionality: Models perform worse with too many features

PCA (Principal Component Analysis)

PCA finds the directions of maximum variance in the data and projects onto those directions.

flowchart LR
  A["Original Data
100 features"] --> B["PCA"]
  B --> C["Reduced Data
2-3 features"]
  B --> D["Variance Explained:
PC1: 45%
PC2: 25%
PC3: 15%"]
  style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

PCA Example: Student Performance

Original features (6 dimensions):

Student	Math	Physics	Chemistry	English	History	Art
Alice	90	88	85	70	65	60
Bob	40	45	50	85	90	88
Carol	85	80	82	72	68	55
Dave	50	48	55	80	82	90
Eve	88	85	90	60	55	50

PCA might discover:

PC1 = “STEM aptitude” (positive correlation with Math, Physics, Chemistry)
PC2 = “Humanities aptitude” (positive correlation with English, History, Art)

6 dimensions compressed to 2 - and the essential structure is preserved.

Anomaly Detection

Find data points that are significantly different from the rest. Critical for:

Fraud detection: Unusual credit card transactions
Network security: Abnormal traffic patterns
Manufacturing: Defective products on assembly lines
Health monitoring: Abnormal vital signs

How It Works

Approach	Idea
Statistical	Points far from the mean (> 3σ) are anomalies
Clustering-based	Points that don’t belong to any cluster
Isolation Forest	Anomalies are easier to isolate (fewer splits needed)
Autoencoders	Train to reconstruct normal data; high reconstruction error → anomaly

Example: Credit Card Fraud

Transaction	Amount ($)	Time	Distance from Home (km)	Anomaly Score
Normal	45	2:00 PM	5	0.1
Normal	120	6:30 PM	12	0.15
Normal	30	10:00 AM	3	0.08
Fraud	2,500	3:15 AM	800	0.95
Normal	80	1:00 PM	8	0.12

The fraudulent transaction has an extremely high anomaly score - different from normal patterns in amount, time, and location.

Evaluating Unsupervised Models

Without labels, evaluation is harder. Common approaches:

Metric	What It Measures	Range
Silhouette Score	How similar points are to their own cluster vs other clusters	-1 to 1 (higher = better)
Inertia	Sum of squared distances within clusters	0 to ∞ (lower = tighter)
Davies-Bouldin Index	Average similarity between clusters	0 to ∞ (lower = better)
Visual inspection	Plot clusters and check if they make sense	Subjective

Supervised vs Unsupervised: When to Use Which

Scenario	Use	Why
Have labeled data, want predictions	Supervised	Known target to optimize for
No labels, want to find groups	Unsupervised	Let the data speak for itself
Too many features, need simplification	Unsupervised	Dimensionality reduction
Want to detect rare events	Unsupervised	Anomaly detection from normal patterns
Pre-processing for supervised model	Both	Cluster first, then classify

Summary

Concept	Key Takeaway
Unsupervised learning	Finds patterns in unlabeled data
Clustering	Groups similar data points together
K-Means	Fast, simple, but needs K specified upfront
DBSCAN	Finds clusters of any shape, handles outliers
PCA	Reduces dimensions while preserving variance
Anomaly detection	Identifies unusual data points
Evaluation	Silhouette score, inertia, visual inspection

Series Complete!

flowchart LR
  A["✅ Intro to ML"] --> B["✅ Supervised Learning"]
  B --> C["✅ Regression"]
  C --> D["✅ Classification"]
  D --> E["✅ Unsupervised Learning"]
  style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style D fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style E fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

You now have a solid foundation in the core concepts of machine learning:

What ML is and how it differs from traditional programming
Supervised learning - learning from labeled data
Regression - predicting continuous values with linear models
Classification - predicting categories with logistic regression and beyond
Unsupervised learning - discovering hidden patterns without labels

From here, you can dive deeper into specific algorithms, explore deep learning, or start building real models with Python and scikit-learn. The fundamentals you’ve built here will carry you through all of it.

Happy learning.

← Back to all series