K-Means clustering
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
K-Means is the simplest clustering algorithm you will actually use in practice. Given data points and a number , it partitions the data into groups so that each point belongs to the cluster with the nearest center. No labels required.
Prerequisites: You should understand what machine learning is and be comfortable with distance metrics (norms).
The algorithm
K-Means alternates between two steps:
- Assign each point to the nearest centroid
- Update each centroid to the mean of its assigned points
Repeat until the assignments stop changing.
More formally, given data points and clusters:
Initialize centroids (randomly or with K-Means++).
Repeat:
Assignment step: For each point , assign it to the nearest centroid:
Update step: Recompute each centroid as the mean of its cluster:
where is the set of points assigned to cluster .
Until assignments do not change.
The objective K-Means minimizes is the within-cluster sum of squares (WCSS):
Each iteration is guaranteed to decrease (or maintain) , and since there are finitely many possible assignments, the algorithm always converges. However, it converges to a local minimum, not necessarily the global one. The result depends on the initial centroids.
K-Means clustering: before and after (K=3)
Example 1: K-Means on 6 points, step by step
Consider 6 points in 2D with :
| Point | |
|---|---|
Initialize: Pick (point ) and (point ) as initial centroids.
Iteration 1: Assignment
Compute the squared Euclidean distance from each point to each centroid:
| Point | Cluster | ||
|---|---|---|---|
| C1 | |||
| C1 | |||
| C1 | |||
| C2 | |||
| C2 | |||
| C2 |
Cluster 1: , Cluster 2:
Iteration 1: Update
Iteration 2: Assignment
| Point | Cluster | ||
|---|---|---|---|
| C1 | |||
| C1 | |||
| C1 | |||
| C2 | |||
| C2 | |||
| C2 |
Assignments unchanged. Converged.
Final clusters:
- Cluster 1 with centroid
- Cluster 2 with centroid
Final WCSS:
Initialization matters: K-Means++
Standard K-Means picks initial centroids at random, which can lead to poor results if two centroids land in the same dense region.
K-Means++ picks centroids that are spread out:
- Choose the first centroid uniformly at random from the data
- For each remaining centroid :
- Compute for each point
- Choose with probability proportional to
- Proceed with standard K-Means
Points that are far from existing centroids are more likely to be selected. This simple change dramatically reduces the chance of bad initializations.
K-Means++ gives an expected WCSS that is at most times the optimal WCSS. In practice, it almost always leads to a better final clustering than random initialization.
Example 2: computing WCSS for different K (elbow method)
Using our 6 points, let’s compute WCSS for :
K = 1: All points in one cluster.
Computing each term:
K = 2: From Example 1 we found .
K = 3: Suppose K-Means finds clusters , , :
| WCSS () | Drop from previous | |
|---|---|---|
| 1 | 70.18 | — |
| 2 | 9.34 | 60.84 |
| 3 | 8.50 | 0.84 |
The big drop happens from to . After that, the improvement is tiny. The “elbow” is at , which matches the natural grouping in the data.
Choosing K: the elbow method
Plot WCSS against . Look for the point where adding more clusters gives diminishing returns. That “elbow” in the curve suggests the right number of clusters.
graph LR A[Plot WCSS vs K] --> B[Find the elbow] B --> C[Pick K at the bend]
The elbow method is a heuristic, not a formal test. Sometimes the elbow is not obvious. Other approaches include:
- Silhouette score: Measures how similar each point is to its own cluster versus other clusters. Ranges from to ; higher is better.
- Gap statistic: Compares the WCSS to what you would expect from uniformly distributed data. The optimal has the largest gap.
- Domain knowledge: Often the most reliable guide. If you are segmenting customers, business context might tell you 3-5 segments make sense.
Convergence properties
K-Means is guaranteed to converge because:
- The assignment step minimizes with respect to assignments (given fixed centroids)
- The update step minimizes with respect to centroids (given fixed assignments)
- is bounded below by 0
- There are finitely many possible assignment configurations
However, convergence can be slow. The worst case is exponential in the number of points, though this is rare in practice. Typically K-Means converges in a few dozen iterations.
The algorithm finds a local minimum. Running K-Means multiple times with different initializations and picking the result with the lowest WCSS is standard practice.
Limitations
K-Means makes strong assumptions:
- Spherical clusters: It uses Euclidean distance, so it works best when clusters are roughly round and equally sized. Elongated or oddly shaped clusters will confuse it.
- Fixed K: You must specify in advance. For data where the number of clusters is unknown, this is a real limitation.
- Sensitivity to outliers: A single outlier can pull a centroid far from where it should be, since centroids are means.
- Euclidean distance only: Features on different scales will distort the clustering. Always standardize your features first.
For non-spherical clusters, consider Gaussian mixture models (the next topic in this series). For clusters with varying density, look at DBSCAN.
Implementation
Here is K-Means in about 20 lines of Python:
import numpy as np
def kmeans(X, K, max_iters=100):
n = X.shape[0]
# K-Means++ initialization
centroids = [X[np.random.randint(n)]]
for _ in range(1, K):
dists = np.min([np.sum((X - c)**2, axis=1) for c in centroids], axis=0)
probs = dists / dists.sum()
centroids.append(X[np.random.choice(n, p=probs)])
centroids = np.array(centroids)
for _ in range(max_iters):
# Assignment step
dists = np.array([np.sum((X - c)**2, axis=1) for c in centroids])
labels = np.argmin(dists, axis=0)
# Update step
new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return labels, centroids
Summary
| Concept | Key idea |
|---|---|
| Algorithm | Alternate between assigning points and updating centroids |
| K-Means++ | Smart initialization; pick centroids that are spread out |
| WCSS | Objective function: sum of squared distances to cluster centers |
| Elbow method | Plot WCSS vs K; pick the bend |
| Convergence | Always converges to a local minimum |
| Limitations | Assumes spherical clusters, needs K specified, sensitive to outliers |
What comes next
K-Means gives you hard cluster assignments: each point belongs to exactly one cluster. But what if a point is 70% cluster A and 30% cluster B? Before we get to soft clustering with Gaussian mixture models, we will first look at PCA for dimensionality reduction, which helps you visualize and compress high-dimensional data.