Search…

K-Means clustering

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

K-Means is the simplest clustering algorithm you will actually use in practice. Given nn data points and a number KK, it partitions the data into KK groups so that each point belongs to the cluster with the nearest center. No labels required.

Prerequisites: You should understand what machine learning is and be comfortable with distance metrics (norms).

The algorithm

K-Means alternates between two steps:

  1. Assign each point to the nearest centroid
  2. Update each centroid to the mean of its assigned points

Repeat until the assignments stop changing.

More formally, given data points x1,x2,,xnRdx_1, x_2, \ldots, x_n \in \mathbb{R}^d and KK clusters:

Initialize centroids c1,c2,,cKc_1, c_2, \ldots, c_K (randomly or with K-Means++).

Repeat:

Assignment step: For each point xix_i, assign it to the nearest centroid:

zi=argminkxick2z_i = \arg\min_{k} \|x_i - c_k\|^2

Update step: Recompute each centroid as the mean of its cluster:

ck=1SkxiSkxic_k = \frac{1}{|S_k|}\sum_{x_i \in S_k} x_i

where Sk={xi:zi=k}S_k = \{x_i : z_i = k\} is the set of points assigned to cluster kk.

Until assignments ziz_i do not change.

The objective K-Means minimizes is the within-cluster sum of squares (WCSS):

J=k=1KxiSkxick2J = \sum_{k=1}^{K} \sum_{x_i \in S_k} \|x_i - c_k\|^2

Each iteration is guaranteed to decrease (or maintain) JJ, and since there are finitely many possible assignments, the algorithm always converges. However, it converges to a local minimum, not necessarily the global one. The result depends on the initial centroids.

K-Means clustering: before and after (K=3)

Example 1: K-Means on 6 points, step by step

Consider 6 points in 2D with K=2K = 2:

Pointxx
AA(1,1)(1, 1)
BB(2,2)(2, 2)
CC(4,3)(4, 3)
DD(6,6)(6, 6)
EE(7,7)(7, 7)
FF(8,6)(8, 6)

Initialize: Pick c1=(1,1)c_1 = (1, 1) (point AA) and c2=(7,7)c_2 = (7, 7) (point EE) as initial centroids.

Iteration 1: Assignment

Compute the squared Euclidean distance from each point to each centroid:

Pointxc12\|x - c_1\|^2xc22\|x - c_2\|^2Cluster
A=(1,1)A = (1,1)(0)2+(0)2=0(0)^2 + (0)^2 = 0(6)2+(6)2=72(6)^2 + (6)^2 = 72C1
B=(2,2)B = (2,2)(1)2+(1)2=2(1)^2 + (1)^2 = 2(5)2+(5)2=50(5)^2 + (5)^2 = 50C1
C=(4,3)C = (4,3)(3)2+(2)2=13(3)^2 + (2)^2 = 13(3)2+(4)2=25(3)^2 + (4)^2 = 25C1
D=(6,6)D = (6,6)(5)2+(5)2=50(5)^2 + (5)^2 = 50(1)2+(1)2=2(1)^2 + (1)^2 = 2C2
E=(7,7)E = (7,7)(6)2+(6)2=72(6)^2 + (6)^2 = 72(0)2+(0)2=0(0)^2 + (0)^2 = 0C2
F=(8,6)F = (8,6)(7)2+(5)2=74(7)^2 + (5)^2 = 74(1)2+(1)2=2(1)^2 + (1)^2 = 2C2

Cluster 1: {A,B,C}\{A, B, C\}, Cluster 2: {D,E,F}\{D, E, F\}

Iteration 1: Update

c1=(1,1)+(2,2)+(4,3)3=(7,6)3(2.33,2.00)c_1 = \frac{(1,1) + (2,2) + (4,3)}{3} = \frac{(7, 6)}{3} \approx (2.33, 2.00)

c2=(6,6)+(7,7)+(8,6)3=(21,19)3=(7.00,6.33)c_2 = \frac{(6,6) + (7,7) + (8,6)}{3} = \frac{(21, 19)}{3} = (7.00, 6.33)

Iteration 2: Assignment

Pointxc12\|x - c_1\|^2xc22\|x - c_2\|^2Cluster
A=(1,1)A = (1,1)(1.33)2+(1)2=2.77(1.33)^2 + (1)^2 = 2.77(6)2+(5.33)2=64.41(6)^2 + (5.33)^2 = 64.41C1
B=(2,2)B = (2,2)(0.33)2+(0)2=0.11(0.33)^2 + (0)^2 = 0.11(5)2+(4.33)2=43.75(5)^2 + (4.33)^2 = 43.75C1
C=(4,3)C = (4,3)(1.67)2+(1)2=3.79(1.67)^2 + (1)^2 = 3.79(3)2+(3.33)2=20.09(3)^2 + (3.33)^2 = 20.09C1
D=(6,6)D = (6,6)(3.67)2+(4)2=29.47(3.67)^2 + (4)^2 = 29.47(1)2+(0.33)2=1.11(1)^2 + (0.33)^2 = 1.11C2
E=(7,7)E = (7,7)(4.67)2+(5)2=46.81(4.67)^2 + (5)^2 = 46.81(0)2+(0.67)2=0.45(0)^2 + (0.67)^2 = 0.45C2
F=(8,6)F = (8,6)(5.67)2+(4)2=48.15(5.67)^2 + (4)^2 = 48.15(1)2+(0.33)2=1.11(1)^2 + (0.33)^2 = 1.11C2

Assignments unchanged. Converged.

Final clusters:

  • Cluster 1 {A,B,C}\{A, B, C\} with centroid (2.33,2.00)(2.33, 2.00)
  • Cluster 2 {D,E,F}\{D, E, F\} with centroid (7.00,6.33)(7.00, 6.33)

Final WCSS:

J=(2.77+0.11+3.79)+(1.11+0.45+1.11)=6.67+2.67=9.34J = (2.77 + 0.11 + 3.79) + (1.11 + 0.45 + 1.11) = 6.67 + 2.67 = 9.34

Initialization matters: K-Means++

Standard K-Means picks initial centroids at random, which can lead to poor results if two centroids land in the same dense region.

K-Means++ picks centroids that are spread out:

  1. Choose the first centroid c1c_1 uniformly at random from the data
  2. For each remaining centroid ckc_k:
    • Compute D(xi)=minj<kxicj2D(x_i) = \min_{j < k} \|x_i - c_j\|^2 for each point
    • Choose ckc_k with probability proportional to D(xi)D(x_i)
  3. Proceed with standard K-Means

Points that are far from existing centroids are more likely to be selected. This simple change dramatically reduces the chance of bad initializations.

K-Means++ gives an expected WCSS that is at most O(logK)O(\log K) times the optimal WCSS. In practice, it almost always leads to a better final clustering than random initialization.

Example 2: computing WCSS for different K (elbow method)

Using our 6 points, let’s compute WCSS for K=1,2,3K = 1, 2, 3:

K = 1: All points in one cluster.

c=(1,1)+(2,2)+(4,3)+(6,6)+(7,7)+(8,6)6=(28,25)6(4.67,4.17)c = \frac{(1,1) + (2,2) + (4,3) + (6,6) + (7,7) + (8,6)}{6} = \frac{(28, 25)}{6} \approx (4.67, 4.17)

J=Ac2+Bc2+Cc2+Dc2+Ec2+Fc2J = \|A - c\|^2 + \|B - c\|^2 + \|C - c\|^2 + \|D - c\|^2 + \|E - c\|^2 + \|F - c\|^2

Computing each term:

Ac2=(3.67)2+(3.17)2=13.47+10.05=23.52\|A - c\|^2 = (3.67)^2 + (3.17)^2 = 13.47 + 10.05 = 23.52 Bc2=(2.67)2+(2.17)2=7.13+4.71=11.84\|B - c\|^2 = (2.67)^2 + (2.17)^2 = 7.13 + 4.71 = 11.84 Cc2=(0.67)2+(1.17)2=0.45+1.37=1.82\|C - c\|^2 = (0.67)^2 + (1.17)^2 = 0.45 + 1.37 = 1.82 Dc2=(1.33)2+(1.83)2=1.77+3.35=5.12\|D - c\|^2 = (1.33)^2 + (1.83)^2 = 1.77 + 3.35 = 5.12 Ec2=(2.33)2+(2.83)2=5.43+8.01=13.44\|E - c\|^2 = (2.33)^2 + (2.83)^2 = 5.43 + 8.01 = 13.44 Fc2=(3.33)2+(1.83)2=11.09+3.35=14.44\|F - c\|^2 = (3.33)^2 + (1.83)^2 = 11.09 + 3.35 = 14.44

J1=23.52+11.84+1.82+5.12+13.44+14.44=70.18J_1 = 23.52 + 11.84 + 1.82 + 5.12 + 13.44 + 14.44 = 70.18

K = 2: From Example 1 we found J2=9.34J_2 = 9.34.

K = 3: Suppose K-Means finds clusters {A,B}\{A, B\}, {C,D}\{C, D\}, {E,F}\{E, F\}:

c1=(1.5,1.5),c2=(5,4.5),c3=(7.5,6.5)c_1 = (1.5, 1.5), \quad c_2 = (5, 4.5), \quad c_3 = (7.5, 6.5)

J3=[Ac12+Bc12]+[Cc22+Dc22]+[Ec32+Fc32]J_3 = [\|A-c_1\|^2 + \|B-c_1\|^2] + [\|C-c_2\|^2 + \|D-c_2\|^2] + [\|E-c_3\|^2 + \|F-c_3\|^2] =[0.5+0.5]+[3.25+3.25]+[0.5+0.5]=1.0+6.5+1.0=8.5= [0.5 + 0.5] + [3.25 + 3.25] + [0.5 + 0.5] = 1.0 + 6.5 + 1.0 = 8.5

KKWCSS (JJ)Drop from previous
170.18
29.3460.84
38.500.84

The big drop happens from K=1K=1 to K=2K=2. After that, the improvement is tiny. The “elbow” is at K=2K=2, which matches the natural grouping in the data.

Choosing K: the elbow method

Plot WCSS against KK. Look for the point where adding more clusters gives diminishing returns. That “elbow” in the curve suggests the right number of clusters.

graph LR
  A[Plot WCSS vs K] --> B[Find the elbow]
  B --> C[Pick K at the bend]

The elbow method is a heuristic, not a formal test. Sometimes the elbow is not obvious. Other approaches include:

  • Silhouette score: Measures how similar each point is to its own cluster versus other clusters. Ranges from 1-1 to 11; higher is better.
  • Gap statistic: Compares the WCSS to what you would expect from uniformly distributed data. The optimal KK has the largest gap.
  • Domain knowledge: Often the most reliable guide. If you are segmenting customers, business context might tell you 3-5 segments make sense.

Convergence properties

K-Means is guaranteed to converge because:

  1. The assignment step minimizes JJ with respect to assignments (given fixed centroids)
  2. The update step minimizes JJ with respect to centroids (given fixed assignments)
  3. JJ is bounded below by 0
  4. There are finitely many possible assignment configurations

However, convergence can be slow. The worst case is exponential in the number of points, though this is rare in practice. Typically K-Means converges in a few dozen iterations.

The algorithm finds a local minimum. Running K-Means multiple times with different initializations and picking the result with the lowest WCSS is standard practice.

Limitations

K-Means makes strong assumptions:

  • Spherical clusters: It uses Euclidean distance, so it works best when clusters are roughly round and equally sized. Elongated or oddly shaped clusters will confuse it.
  • Fixed K: You must specify KK in advance. For data where the number of clusters is unknown, this is a real limitation.
  • Sensitivity to outliers: A single outlier can pull a centroid far from where it should be, since centroids are means.
  • Euclidean distance only: Features on different scales will distort the clustering. Always standardize your features first.

For non-spherical clusters, consider Gaussian mixture models (the next topic in this series). For clusters with varying density, look at DBSCAN.

Implementation

Here is K-Means in about 20 lines of Python:

import numpy as np

def kmeans(X, K, max_iters=100):
    n = X.shape[0]
    # K-Means++ initialization
    centroids = [X[np.random.randint(n)]]
    for _ in range(1, K):
        dists = np.min([np.sum((X - c)**2, axis=1) for c in centroids], axis=0)
        probs = dists / dists.sum()
        centroids.append(X[np.random.choice(n, p=probs)])
    centroids = np.array(centroids)

    for _ in range(max_iters):
        # Assignment step
        dists = np.array([np.sum((X - c)**2, axis=1) for c in centroids])
        labels = np.argmin(dists, axis=0)
        # Update step
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return labels, centroids

Summary

ConceptKey idea
AlgorithmAlternate between assigning points and updating centroids
K-Means++Smart initialization; pick centroids that are spread out
WCSSObjective function: sum of squared distances to cluster centers
Elbow methodPlot WCSS vs K; pick the bend
ConvergenceAlways converges to a local minimum
LimitationsAssumes spherical clusters, needs K specified, sensitive to outliers

What comes next

K-Means gives you hard cluster assignments: each point belongs to exactly one cluster. But what if a point is 70% cluster A and 30% cluster B? Before we get to soft clustering with Gaussian mixture models, we will first look at PCA for dimensionality reduction, which helps you visualize and compress high-dimensional data.

Start typing to search across all content
navigate Enter open Esc close