Unsupervised Learning - Finding Hidden Patterns
In this series (5 parts)
- Introduction to Machine Learning
- Supervised Learning - Learning from Labeled Data
- Regression - Predicting Continuous Values
- Classification - Predicting Categories
- Unsupervised Learning - Finding Hidden Patterns
So far in this series, we’ve worked with labeled data - every training example had a known answer. But what happens when you have data with no labels at all?
Welcome to unsupervised learning.
What is Unsupervised Learning?
Unsupervised learning is about finding hidden structure in data without being told what to look for. There are no labels, no “correct answers” - just raw data, and the model’s job is to discover patterns.
flowchart LR A["Unlabeled Data (no correct answers)"] --> B["Unsupervised Algorithm"] B --> C["Discovered Structure (clusters, patterns, reduced dimensions)"] style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276 style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
Supervised vs Unsupervised
| Aspect | Supervised | Unsupervised |
|---|---|---|
| Data | Labeled (input + output) | Unlabeled (input only) |
| Goal | Predict known outcomes | Discover hidden patterns |
| Feedback | Model knows if it’s right/wrong | No explicit feedback |
| Examples | Spam detection, price prediction | Customer segmentation, anomaly detection |
| Evaluation | Accuracy, F1, RMSE | Silhouette score, inertia, visual inspection |
Why Unsupervised Learning Matters
In the real world, most data is unlabeled. Labeling data is expensive, slow, and sometimes impossible:
- A hospital has millions of patient records but no one has tagged each record with disease labels
- An e-commerce site has billions of clickstreams but hasn’t categorized every user
- A security system has network logs but doesn’t know which are attacks
Unsupervised learning lets you extract value from raw, unlabeled data.
Types of Unsupervised Learning
flowchart TD UL["Unsupervised Learning"] --> CL["Clustering"] UL --> DR["Dimensionality Reduction"] UL --> AD["Anomaly Detection"] UL --> AR["Association Rules"] CL --> KM["K-Means"] CL --> HC["Hierarchical"] CL --> DB["DBSCAN"] DR --> PCA["PCA"] DR --> TSNE["t-SNE"] DR --> UMAP["UMAP"] AD --> IF["Isolation Forest"] AD --> AE["Autoencoders"] AR --> APR["Apriori"] style UL fill:#f3e8fd,stroke:#5b1a7a,color:#5b1a7a style CL fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style DR fill:#e8f4fd,stroke:#1a5276,color:#1a5276 style AD fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a style AR fill:#fde8e8,stroke:#7a1a1a,color:#7a1a1a
Real-World Applications
| Application | Technique | What It Discovers |
|---|---|---|
| Customer segmentation | Clustering | Groups of similar customers for targeted marketing |
| Fraud detection | Anomaly detection | Unusual transactions that deviate from normal patterns |
| Document organization | Topic modeling | Themes across thousands of documents |
| Gene expression analysis | Clustering | Groups of genes with similar behavior |
| Image compression | Dimensionality reduction | Compressed representation of images |
| Market basket analysis | Association rules | Products frequently bought together |
| Social network analysis | Clustering | Communities within a network |
| Data visualization | t-SNE / UMAP | 2D maps of high-dimensional data |
Clustering: Grouping Similar Data
Clustering is the most common unsupervised task. The goal: group data points so that points in the same group (cluster) are more similar to each other than to points in other groups.
K-Means Clustering
The most popular clustering algorithm. Here’s how it works:
flowchart TD
A["1. Choose K
(number of clusters)"] --> B["2. Randomly place K
cluster centers (centroids)"]
B --> C["3. Assign each point
to nearest centroid"]
C --> D["4. Move each centroid
to mean of its points"]
D --> E{"Centroids
moved?"}
E -->|"Yes"| C
E -->|"No"| F["Done! Clusters found"]
style A fill:#e8f4fd,stroke:#1a5276,color:#1a5276
style C fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
style D fill:#fdf3e8,stroke:#7a4a1a,color:#7a4a1a
style F fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
Example: Customer Segmentation
An e-commerce company has customer data with no predefined groups:
| Customer | Annual Spend ($) | Visits/Month | Avg Order ($) | Items/Order |
|---|---|---|---|---|
| A | 12,000 | 20 | 600 | 8 |
| B | 800 | 2 | 400 | 3 |
| C | 15,000 | 25 | 600 | 10 |
| D | 500 | 1 | 500 | 2 |
| E | 3,000 | 8 | 375 | 5 |
| F | 11,000 | 18 | 611 | 7 |
| G | 2,500 | 6 | 417 | 4 |
| H | 600 | 2 | 300 | 2 |
After running K-Means with K=3:
| Cluster | Customers | Profile | Marketing Strategy |
|---|---|---|---|
| Premium | A, C, F | High spend, frequent visits | VIP rewards, early access |
| Regular | E, G | Medium spend, moderate visits | Loyalty programs, upsells |
| Occasional | B, D, H | Low spend, rare visits | Re-engagement campaigns |
No one told the algorithm these groups existed - it discovered them from the data.
Choosing K: The Elbow Method
How do you know the right number of clusters? Plot the inertia (sum of squared distances to nearest centroid) for different values of K:
| K | Inertia | Change |
|---|---|---|
| 1 | 1000 | - |
| 2 | 500 | -500 |
| 3 | 250 | -250 |
| 4 | 200 | -50 |
| 5 | 180 | -20 |
| 6 | 170 | -10 |
The “elbow” is at K=3 - after that, adding more clusters gives diminishing returns.
K-Means Limitations
| Limitation | Description |
|---|---|
| Must specify K | You need to choose the number of clusters in advance |
| Assumes spherical clusters | Struggles with elongated or irregular shapes |
| Sensitive to initialization | Different starting centroids → different results |
| Sensitive to outliers | Outliers can pull centroids away from true centers |
| Only finds convex clusters | Can’t discover non-convex shapes |
Hierarchical Clustering
Unlike K-Means, hierarchical clustering doesn’t require you to specify K upfront. It builds a tree (dendrogram) of clusters.
flowchart TD A["Agglomerative (Bottom-Up)"] --> A1["Start: each point is its own cluster"] A1 --> A2["Merge the two closest clusters"] A2 --> A3["Repeat until one cluster remains"] B["Divisive (Top-Down)"] --> B1["Start: all points in one cluster"] B1 --> B2["Split the least coherent cluster"] B2 --> B3["Repeat until each point is its own cluster"] style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276
You choose the number of clusters by “cutting” the dendrogram at a desired height.
DBSCAN: Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of arbitrary shape and automatically identifies outliers.
| Parameter | Description |
|---|---|
| ε (epsilon) | Maximum distance between two points to be neighbors |
| MinPts | Minimum points needed to form a dense region |
How DBSCAN Classifies Points
| Point Type | Definition |
|---|---|
| Core point | Has at least MinPts neighbors within ε |
| Border point | Within ε of a core point, but fewer than MinPts neighbors |
| Noise | Neither core nor border - an outlier |
K-Means vs DBSCAN
| Aspect | K-Means | DBSCAN |
|---|---|---|
| Requires K? | Yes | No |
| Cluster shape | Spherical | Arbitrary |
| Handles noise? | No | Yes (labels as outliers) |
| Handles varying density? | No | Limited |
| Speed | Fast | Moderate |
Dimensionality Reduction
Real-world datasets often have dozens or hundreds of features. Dimensionality reduction compresses data to fewer dimensions while preserving important information.
Why Reduce Dimensions?
- Visualization: Plot high-dimensional data in 2D/3D
- Speed: Fewer features → faster training
- Noise reduction: Remove irrelevant features
- Avoid curse of dimensionality: Models perform worse with too many features
PCA (Principal Component Analysis)
PCA finds the directions of maximum variance in the data and projects onto those directions.
flowchart LR A["Original Data 100 features"] --> B["PCA"] B --> C["Reduced Data 2-3 features"] B --> D["Variance Explained: PC1: 45% PC2: 25% PC3: 15%"] style B fill:#e8f4fd,stroke:#1a5276,color:#1a5276 style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
PCA Example: Student Performance
Original features (6 dimensions):
| Student | Math | Physics | Chemistry | English | History | Art |
|---|---|---|---|---|---|---|
| Alice | 90 | 88 | 85 | 70 | 65 | 60 |
| Bob | 40 | 45 | 50 | 85 | 90 | 88 |
| Carol | 85 | 80 | 82 | 72 | 68 | 55 |
| Dave | 50 | 48 | 55 | 80 | 82 | 90 |
| Eve | 88 | 85 | 90 | 60 | 55 | 50 |
PCA might discover:
- PC1 = “STEM aptitude” (positive correlation with Math, Physics, Chemistry)
- PC2 = “Humanities aptitude” (positive correlation with English, History, Art)
6 dimensions compressed to 2 - and the essential structure is preserved.
Anomaly Detection
Find data points that are significantly different from the rest. Critical for:
- Fraud detection: Unusual credit card transactions
- Network security: Abnormal traffic patterns
- Manufacturing: Defective products on assembly lines
- Health monitoring: Abnormal vital signs
How It Works
| Approach | Idea |
|---|---|
| Statistical | Points far from the mean (> 3σ) are anomalies |
| Clustering-based | Points that don’t belong to any cluster |
| Isolation Forest | Anomalies are easier to isolate (fewer splits needed) |
| Autoencoders | Train to reconstruct normal data; high reconstruction error → anomaly |
Example: Credit Card Fraud
| Transaction | Amount ($) | Time | Distance from Home (km) | Anomaly Score |
|---|---|---|---|---|
| Normal | 45 | 2:00 PM | 5 | 0.1 |
| Normal | 120 | 6:30 PM | 12 | 0.15 |
| Normal | 30 | 10:00 AM | 3 | 0.08 |
| Fraud | 2,500 | 3:15 AM | 800 | 0.95 |
| Normal | 80 | 1:00 PM | 8 | 0.12 |
The fraudulent transaction has an extremely high anomaly score - different from normal patterns in amount, time, and location.
Evaluating Unsupervised Models
Without labels, evaluation is harder. Common approaches:
| Metric | What It Measures | Range |
|---|---|---|
| Silhouette Score | How similar points are to their own cluster vs other clusters | -1 to 1 (higher = better) |
| Inertia | Sum of squared distances within clusters | 0 to ∞ (lower = tighter) |
| Davies-Bouldin Index | Average similarity between clusters | 0 to ∞ (lower = better) |
| Visual inspection | Plot clusters and check if they make sense | Subjective |
Supervised vs Unsupervised: When to Use Which
| Scenario | Use | Why |
|---|---|---|
| Have labeled data, want predictions | Supervised | Known target to optimize for |
| No labels, want to find groups | Unsupervised | Let the data speak for itself |
| Too many features, need simplification | Unsupervised | Dimensionality reduction |
| Want to detect rare events | Unsupervised | Anomaly detection from normal patterns |
| Pre-processing for supervised model | Both | Cluster first, then classify |
Summary
| Concept | Key Takeaway |
|---|---|
| Unsupervised learning | Finds patterns in unlabeled data |
| Clustering | Groups similar data points together |
| K-Means | Fast, simple, but needs K specified upfront |
| DBSCAN | Finds clusters of any shape, handles outliers |
| PCA | Reduces dimensions while preserving variance |
| Anomaly detection | Identifies unusual data points |
| Evaluation | Silhouette score, inertia, visual inspection |
Series Complete!
flowchart LR A["✅ Intro to ML"] --> B["✅ Supervised Learning"] B --> C["✅ Regression"] C --> D["✅ Classification"] D --> E["✅ Unsupervised Learning"] style A fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style B fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style C fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style D fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38 style E fill:#e8f8f0,stroke:#1a5c38,color:#1a5c38
You now have a solid foundation in the core concepts of machine learning:
- What ML is and how it differs from traditional programming
- Supervised learning - learning from labeled data
- Regression - predicting continuous values with linear models
- Classification - predicting categories with logistic regression and beyond
- Unsupervised learning - discovering hidden patterns without labels
From here, you can dive deeper into specific algorithms, explore deep learning, or start building real models with Python and scikit-learn. The fundamentals you’ve built here will carry you through all of it.
Happy learning.