Search…
ML from Scratch · Part 5

Unsupervised Learning - Finding Hidden Patterns

In this series (5 parts)
  1. Introduction to Machine Learning
  2. Supervised Learning - Learning from Labeled Data
  3. Regression - Predicting Continuous Values
  4. Classification - Predicting Categories
  5. Unsupervised Learning - Finding Hidden Patterns

So far in this series, we’ve worked with labeled data - every training example had a known answer. But what happens when you have data with no labels at all?

Welcome to unsupervised learning.

What is Unsupervised Learning?

Unsupervised learning is about finding hidden structure in data without being told what to look for. There are no labels, no “correct answers” - just raw data, and the model’s job is to discover patterns.

flowchart LR
  A["Unlabeled Data
(no correct answers)"] --> B["Unsupervised
Algorithm"]
  B --> C["Discovered Structure
(clusters, patterns,
reduced dimensions)"]
  style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Supervised vs Unsupervised

AspectSupervisedUnsupervised
DataLabeled (input + output)Unlabeled (input only)
GoalPredict known outcomesDiscover hidden patterns
FeedbackModel knows if it’s right/wrongNo explicit feedback
ExamplesSpam detection, price predictionCustomer segmentation, anomaly detection
EvaluationAccuracy, F1, RMSESilhouette score, inertia, visual inspection

Why Unsupervised Learning Matters

In the real world, most data is unlabeled. Labeling data is expensive, slow, and sometimes impossible:

  • A hospital has millions of patient records but no one has tagged each record with disease labels
  • An e-commerce site has billions of clickstreams but hasn’t categorized every user
  • A security system has network logs but doesn’t know which are attacks

Unsupervised learning lets you extract value from raw, unlabeled data.

Types of Unsupervised Learning

flowchart TD
  UL["Unsupervised Learning"] --> CL["Clustering"]
  UL --> DR["Dimensionality Reduction"]
  UL --> AD["Anomaly Detection"]
  UL --> AR["Association Rules"]
  CL --> KM["K-Means"]
  CL --> HC["Hierarchical"]
  CL --> DB["DBSCAN"]
  DR --> PCA["PCA"]
  DR --> TSNE["t-SNE"]
  DR --> UMAP["UMAP"]
  AD --> IF["Isolation Forest"]
  AD --> AE["Autoencoders"]
  AR --> APR["Apriori"]
  style UL fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a
  style CL fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style DR fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style AD fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style AR fill:#fde8e8,stroke:#7a1a1a,color:#7a1a1a

Real-World Applications

ApplicationTechniqueWhat It Discovers
Customer segmentationClusteringGroups of similar customers for targeted marketing
Fraud detectionAnomaly detectionUnusual transactions that deviate from normal patterns
Document organizationTopic modelingThemes across thousands of documents
Gene expression analysisClusteringGroups of genes with similar behavior
Image compressionDimensionality reductionCompressed representation of images
Market basket analysisAssociation rulesProducts frequently bought together
Social network analysisClusteringCommunities within a network
Data visualizationt-SNE / UMAP2D maps of high-dimensional data

Clustering: Grouping Similar Data

Clustering is the most common unsupervised task. The goal: group data points so that points in the same group (cluster) are more similar to each other than to points in other groups.

K-Means Clustering

The most popular clustering algorithm. Here’s how it works:

flowchart TD
  A["1. Choose K
(number of clusters)"] --> B["2. Randomly place K
cluster centers (centroids)"]
  B --> C["3. Assign each point
to nearest centroid"]
  C --> D["4. Move each centroid
to mean of its points"]
  D --> E{"Centroids
moved?"}
  E -->|"Yes"| C
  E -->|"No"| F["Done! Clusters found"]
  style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style C fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
  style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

Example: Customer Segmentation

An e-commerce company has customer data with no predefined groups:

CustomerAnnual Spend ($)Visits/MonthAvg Order ($)Items/Order
A12,000206008
B80024003
C15,0002560010
D50015002
E3,00083755
F11,000186117
G2,50064174
H60023002

After running K-Means with K=3:

ClusterCustomersProfileMarketing Strategy
PremiumA, C, FHigh spend, frequent visitsVIP rewards, early access
RegularE, GMedium spend, moderate visitsLoyalty programs, upsells
OccasionalB, D, HLow spend, rare visitsRe-engagement campaigns

No one told the algorithm these groups existed - it discovered them from the data.

Choosing K: The Elbow Method

How do you know the right number of clusters? Plot the inertia (sum of squared distances to nearest centroid) for different values of K:

KInertiaChange
11000-
2500-500
3250-250
4200-50
5180-20
6170-10

The “elbow” is at K=3 - after that, adding more clusters gives diminishing returns.

K-Means Limitations

LimitationDescription
Must specify KYou need to choose the number of clusters in advance
Assumes spherical clustersStruggles with elongated or irregular shapes
Sensitive to initializationDifferent starting centroids → different results
Sensitive to outliersOutliers can pull centroids away from true centers
Only finds convex clustersCan’t discover non-convex shapes

Hierarchical Clustering

Unlike K-Means, hierarchical clustering doesn’t require you to specify K upfront. It builds a tree (dendrogram) of clusters.

flowchart TD
  A["Agglomerative
(Bottom-Up)"] --> A1["Start: each point is
its own cluster"]
  A1 --> A2["Merge the two closest
clusters"]
  A2 --> A3["Repeat until one
cluster remains"]
  B["Divisive
(Top-Down)"] --> B1["Start: all points in
one cluster"]
  B1 --> B2["Split the least
coherent cluster"]
  B2 --> B3["Repeat until each point
is its own cluster"]
  style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276

You choose the number of clusters by “cutting” the dendrogram at a desired height.

DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of arbitrary shape and automatically identifies outliers.

ParameterDescription
ε (epsilon)Maximum distance between two points to be neighbors
MinPtsMinimum points needed to form a dense region

How DBSCAN Classifies Points

Point TypeDefinition
Core pointHas at least MinPts neighbors within ε
Border pointWithin ε of a core point, but fewer than MinPts neighbors
NoiseNeither core nor border - an outlier

K-Means vs DBSCAN

AspectK-MeansDBSCAN
Requires K?YesNo
Cluster shapeSphericalArbitrary
Handles noise?NoYes (labels as outliers)
Handles varying density?NoLimited
SpeedFastModerate

Dimensionality Reduction

Real-world datasets often have dozens or hundreds of features. Dimensionality reduction compresses data to fewer dimensions while preserving important information.

Why Reduce Dimensions?

  • Visualization: Plot high-dimensional data in 2D/3D
  • Speed: Fewer features → faster training
  • Noise reduction: Remove irrelevant features
  • Avoid curse of dimensionality: Models perform worse with too many features

PCA (Principal Component Analysis)

PCA finds the directions of maximum variance in the data and projects onto those directions.

flowchart LR
  A["Original Data
100 features"] --> B["PCA"]
  B --> C["Reduced Data
2-3 features"]
  B --> D["Variance Explained:
PC1: 45%
PC2: 25%
PC3: 15%"]
  style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

PCA Example: Student Performance

Original features (6 dimensions):

StudentMathPhysicsChemistryEnglishHistoryArt
Alice908885706560
Bob404550859088
Carol858082726855
Dave504855808290
Eve888590605550

PCA might discover:

  • PC1 = “STEM aptitude” (positive correlation with Math, Physics, Chemistry)
  • PC2 = “Humanities aptitude” (positive correlation with English, History, Art)

6 dimensions compressed to 2 - and the essential structure is preserved.

Anomaly Detection

Find data points that are significantly different from the rest. Critical for:

  • Fraud detection: Unusual credit card transactions
  • Network security: Abnormal traffic patterns
  • Manufacturing: Defective products on assembly lines
  • Health monitoring: Abnormal vital signs

How It Works

ApproachIdea
StatisticalPoints far from the mean (> 3σ) are anomalies
Clustering-basedPoints that don’t belong to any cluster
Isolation ForestAnomalies are easier to isolate (fewer splits needed)
AutoencodersTrain to reconstruct normal data; high reconstruction error → anomaly

Example: Credit Card Fraud

TransactionAmount ($)TimeDistance from Home (km)Anomaly Score
Normal452:00 PM50.1
Normal1206:30 PM120.15
Normal3010:00 AM30.08
Fraud2,5003:15 AM8000.95
Normal801:00 PM80.12

The fraudulent transaction has an extremely high anomaly score - different from normal patterns in amount, time, and location.

Evaluating Unsupervised Models

Without labels, evaluation is harder. Common approaches:

MetricWhat It MeasuresRange
Silhouette ScoreHow similar points are to their own cluster vs other clusters-1 to 1 (higher = better)
InertiaSum of squared distances within clusters0 to ∞ (lower = tighter)
Davies-Bouldin IndexAverage similarity between clusters0 to ∞ (lower = better)
Visual inspectionPlot clusters and check if they make senseSubjective

Supervised vs Unsupervised: When to Use Which

ScenarioUseWhy
Have labeled data, want predictionsSupervisedKnown target to optimize for
No labels, want to find groupsUnsupervisedLet the data speak for itself
Too many features, need simplificationUnsupervisedDimensionality reduction
Want to detect rare eventsUnsupervisedAnomaly detection from normal patterns
Pre-processing for supervised modelBothCluster first, then classify

Summary

ConceptKey Takeaway
Unsupervised learningFinds patterns in unlabeled data
ClusteringGroups similar data points together
K-MeansFast, simple, but needs K specified upfront
DBSCANFinds clusters of any shape, handles outliers
PCAReduces dimensions while preserving variance
Anomaly detectionIdentifies unusual data points
EvaluationSilhouette score, inertia, visual inspection

Series Complete!

flowchart LR
  A["✅ Intro to ML"] --> B["✅ Supervised Learning"]
  B --> C["✅ Regression"]
  C --> D["✅ Classification"]
  D --> E["✅ Unsupervised Learning"]
  style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style D fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
  style E fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38

You now have a solid foundation in the core concepts of machine learning:

  1. What ML is and how it differs from traditional programming
  2. Supervised learning - learning from labeled data
  3. Regression - predicting continuous values with linear models
  4. Classification - predicting categories with logistic regression and beyond
  5. Unsupervised learning - discovering hidden patterns without labels

From here, you can dive deeper into specific algorithms, explore deep learning, or start building real models with Python and scikit-learn. The fundamentals you’ve built here will carry you through all of it.

Happy learning.

Start typing to search across all content
navigate Enter open Esc close