Apr 5, 2026 · 16 min read · Machine Learning

K-Means clustering

In this series (18 parts)

K-Means is the simplest clustering algorithm you will actually use in practice. Given $n$ data points and a number $K$ , it partitions the data into $K$ groups so that each point belongs to the cluster with the nearest center. No labels required.

Prerequisites: You should understand what machine learning is and be comfortable with distance metrics (norms).

The algorithm

K-Means alternates between two steps:

Assign each point to the nearest centroid
Update each centroid to the mean of its assigned points

Repeat until the assignments stop changing.

More formally, given data points $x_1, x_2, \ldots, x_n \in \mathbb{R}^d$ and $K$ clusters:

Initialize centroids $c_1, c_2, \ldots, c_K$ (randomly or with K-Means++).

Repeat:

Assignment step: For each point $x_i$ , assign it to the nearest centroid:

$z_i = \arg\min_{k} \|x_i - c_k\|^2$

Update step: Recompute each centroid as the mean of its cluster:

$c_k = \frac{1}{|S_k|}\sum_{x_i \in S_k} x_i$

where $S_k = \{x_i : z_i = k\}$ is the set of points assigned to cluster $k$ .

Until assignments $z_i$ do not change.

The objective K-Means minimizes is the within-cluster sum of squares (WCSS):

$J = \sum_{k=1}^{K} \sum_{x_i \in S_k} \|x_i - c_k\|^2$

Each iteration is guaranteed to decrease (or maintain) $J$ , and since there are finitely many possible assignments, the algorithm always converges. However, it converges to a local minimum, not necessarily the global one. The result depends on the initial centroids.

K-Means clustering: before and after (K=3)

Example 1: K-Means on 6 points, step by step

Consider 6 points in 2D with $K = 2$ :

Point	$x$
$A$	$(1, 1)$
$B$	$(2, 2)$
$C$	$(4, 3)$
$D$	$(6, 6)$
$E$	$(7, 7)$
$F$	$(8, 6)$

Initialize: Pick $c_1 = (1, 1)$ (point $A$ ) and $c_2 = (7, 7)$ (point $E$ ) as initial centroids.

Iteration 1: Assignment

Compute the squared Euclidean distance from each point to each centroid:

Point	$\\|x - c_1\\|^2$	$\\|x - c_2\\|^2$	Cluster
$A = (1,1)$	$(0)^2 + (0)^2 = 0$	$(6)^2 + (6)^2 = 72$	C1
$B = (2,2)$	$(1)^2 + (1)^2 = 2$	$(5)^2 + (5)^2 = 50$	C1
$C = (4,3)$	$(3)^2 + (2)^2 = 13$	$(3)^2 + (4)^2 = 25$	C1
$D = (6,6)$	$(5)^2 + (5)^2 = 50$	$(1)^2 + (1)^2 = 2$	C2
$E = (7,7)$	$(6)^2 + (6)^2 = 72$	$(0)^2 + (0)^2 = 0$	C2
$F = (8,6)$	$(7)^2 + (5)^2 = 74$	$(1)^2 + (1)^2 = 2$	C2

Cluster 1: $\{A, B, C\}$ , Cluster 2: $\{D, E, F\}$

Iteration 1: Update

$c_1 = \frac{(1,1) + (2,2) + (4,3)}{3} = \frac{(7, 6)}{3} \approx (2.33, 2.00)$

$c_2 = \frac{(6,6) + (7,7) + (8,6)}{3} = \frac{(21, 19)}{3} = (7.00, 6.33)$

Iteration 2: Assignment

Point	$\\|x - c_1\\|^2$	$\\|x - c_2\\|^2$	Cluster
$A = (1,1)$	$(1.33)^2 + (1)^2 = 2.77$	$(6)^2 + (5.33)^2 = 64.41$	C1
$B = (2,2)$	$(0.33)^2 + (0)^2 = 0.11$	$(5)^2 + (4.33)^2 = 43.75$	C1
$C = (4,3)$	$(1.67)^2 + (1)^2 = 3.79$	$(3)^2 + (3.33)^2 = 20.09$	C1
$D = (6,6)$	$(3.67)^2 + (4)^2 = 29.47$	$(1)^2 + (0.33)^2 = 1.11$	C2
$E = (7,7)$	$(4.67)^2 + (5)^2 = 46.81$	$(0)^2 + (0.67)^2 = 0.45$	C2
$F = (8,6)$	$(5.67)^2 + (4)^2 = 48.15$	$(1)^2 + (0.33)^2 = 1.11$	C2

Assignments unchanged. Converged.

Final clusters:

Cluster 1 $\{A, B, C\}$ with centroid $(2.33, 2.00)$
Cluster 2 $\{D, E, F\}$ with centroid $(7.00, 6.33)$

Final WCSS:

$J = (2.77 + 0.11 + 3.79) + (1.11 + 0.45 + 1.11) = 6.67 + 2.67 = 9.34$

Initialization matters: K-Means++

Standard K-Means picks initial centroids at random, which can lead to poor results if two centroids land in the same dense region.

K-Means++ picks centroids that are spread out:

Choose the first centroid $c_1$ uniformly at random from the data
For each remaining centroid $c_k$ $c_{k}$ :
- Compute $D(x_i) = \min_{j < k} \|x_i - c_j\|^2$ for each point
- Choose $c_k$ with probability proportional to $D(x_i)$
Proceed with standard K-Means

Points that are far from existing centroids are more likely to be selected. This simple change dramatically reduces the chance of bad initializations.

K-Means++ gives an expected WCSS that is at most $O(\log K)$ times the optimal WCSS. In practice, it almost always leads to a better final clustering than random initialization.

Example 2: computing WCSS for different K (elbow method)

Using our 6 points, let’s compute WCSS for $K = 1, 2, 3$ :

K = 1: All points in one cluster.

$c = \frac{(1,1) + (2,2) + (4,3) + (6,6) + (7,7) + (8,6)}{6} = \frac{(28, 25)}{6} \approx (4.67, 4.17)$

$J = \|A - c\|^2 + \|B - c\|^2 + \|C - c\|^2 + \|D - c\|^2 + \|E - c\|^2 + \|F - c\|^2$

Computing each term:

$\|A - c\|^2 = (3.67)^2 + (3.17)^2 = 13.47 + 10.05 = 23.52$ $\|B - c\|^2 = (2.67)^2 + (2.17)^2 = 7.13 + 4.71 = 11.84$ $\|C - c\|^2 = (0.67)^2 + (1.17)^2 = 0.45 + 1.37 = 1.82$ $\|D - c\|^2 = (1.33)^2 + (1.83)^2 = 1.77 + 3.35 = 5.12$ $\|E - c\|^2 = (2.33)^2 + (2.83)^2 = 5.43 + 8.01 = 13.44$ $\|F - c\|^2 = (3.33)^2 + (1.83)^2 = 11.09 + 3.35 = 14.44$

$J_1 = 23.52 + 11.84 + 1.82 + 5.12 + 13.44 + 14.44 = 70.18$

K = 2: From Example 1 we found $J_2 = 9.34$ .

K = 3: Suppose K-Means finds clusters $\{A, B\}$ , $\{C, D\}$ , $\{E, F\}$ :

$c_1 = (1.5, 1.5), \quad c_2 = (5, 4.5), \quad c_3 = (7.5, 6.5)$

$J_3 = [\|A-c_1\|^2 + \|B-c_1\|^2] + [\|C-c_2\|^2 + \|D-c_2\|^2] + [\|E-c_3\|^2 + \|F-c_3\|^2]$ $= [0.5 + 0.5] + [3.25 + 3.25] + [0.5 + 0.5] = 1.0 + 6.5 + 1.0 = 8.5$

$K$	WCSS ( $J$ )	Drop from previous
1	70.18	—
2	9.34	60.84
3	8.50	0.84

The big drop happens from $K=1$ to $K=2$ . After that, the improvement is tiny. The “elbow” is at $K=2$ , which matches the natural grouping in the data.

Choosing K: the elbow method

Plot WCSS against $K$ . Look for the point where adding more clusters gives diminishing returns. That “elbow” in the curve suggests the right number of clusters.

graph LR
  A[Plot WCSS vs K] --> B[Find the elbow]
  B --> C[Pick K at the bend]

The elbow method is a heuristic, not a formal test. Sometimes the elbow is not obvious. Other approaches include:

Silhouette score: Measures how similar each point is to its own cluster versus other clusters. Ranges from $-1$ to $1$ ; higher is better.
Gap statistic: Compares the WCSS to what you would expect from uniformly distributed data. The optimal $K$ has the largest gap.
Domain knowledge: Often the most reliable guide. If you are segmenting customers, business context might tell you 3-5 segments make sense.

Convergence properties

K-Means is guaranteed to converge because:

The assignment step minimizes $J$ with respect to assignments (given fixed centroids)
The update step minimizes $J$ with respect to centroids (given fixed assignments)
$J$ is bounded below by 0
There are finitely many possible assignment configurations

However, convergence can be slow. The worst case is exponential in the number of points, though this is rare in practice. Typically K-Means converges in a few dozen iterations.

The algorithm finds a local minimum. Running K-Means multiple times with different initializations and picking the result with the lowest WCSS is standard practice.

Limitations

K-Means makes strong assumptions:

Spherical clusters: It uses Euclidean distance, so it works best when clusters are roughly round and equally sized. Elongated or oddly shaped clusters will confuse it.
Fixed K: You must specify $K$ in advance. For data where the number of clusters is unknown, this is a real limitation.
Sensitivity to outliers: A single outlier can pull a centroid far from where it should be, since centroids are means.
Euclidean distance only: Features on different scales will distort the clustering. Always standardize your features first.

For non-spherical clusters, consider Gaussian mixture models (the next topic in this series). For clusters with varying density, look at DBSCAN.

Implementation

Here is K-Means in about 20 lines of Python:

import numpy as np

def kmeans(X, K, max_iters=100):
    n = X.shape[0]
    # K-Means++ initialization
    centroids = [X[np.random.randint(n)]]
    for _ in range(1, K):
        dists = np.min([np.sum((X - c)**2, axis=1) for c in centroids], axis=0)
        probs = dists / dists.sum()
        centroids.append(X[np.random.choice(n, p=probs)])
    centroids = np.array(centroids)

    for _ in range(max_iters):
        # Assignment step
        dists = np.array([np.sum((X - c)**2, axis=1) for c in centroids])
        labels = np.argmin(dists, axis=0)
        # Update step
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return labels, centroids

Summary

Concept	Key idea
Algorithm	Alternate between assigning points and updating centroids
K-Means++	Smart initialization; pick centroids that are spread out
WCSS	Objective function: sum of squared distances to cluster centers
Elbow method	Plot WCSS vs K; pick the bend
Convergence	Always converges to a local minimum
Limitations	Assumes spherical clusters, needs K specified, sensitive to outliers

What comes next

K-Means gives you hard cluster assignments: each point belongs to exactly one cluster. But what if a point is 70% cluster A and 30% cluster B? Before we get to soft clustering with Gaussian mixture models, we will first look at PCA for dimensionality reduction, which helps you visualize and compress high-dimensional data.

← Back to all series