Search…

Gaussian mixture models and EM algorithm

In this series (18 parts)
  1. What is machine learning: a map of the field
  2. Data, features, and the ML pipeline
  3. Linear regression
  4. Bias, variance, and the tradeoff
  5. Regularization: Ridge, Lasso, and ElasticNet
  6. Logistic regression and classification
  7. Evaluation metrics for classification
  8. Naive Bayes classifier
  9. K-Nearest Neighbors
  10. Decision trees
  11. Ensemble methods: Bagging and Random Forests
  12. Boosting: AdaBoost and Gradient Boosting
  13. Support Vector Machines
  14. K-Means clustering
  15. Dimensionality Reduction: PCA
  16. Gaussian mixture models and EM algorithm
  17. Model selection and cross-validation
  18. Feature engineering and selection

K-Means assigns each point to exactly one cluster. That is a hard decision. But real data often sits between clusters, and pretending otherwise throws away useful information. Gaussian mixture models (GMMs) fix this by giving each point a probability of belonging to each cluster. The EM algorithm is how we fit these models.

Prerequisites: You should understand probability distributions (especially the Gaussian) and K-Means clustering.

Why GMMs and EM?

You measure the heights of 1000 people. The histogram shows two bumps, not one. Why? Because your sample contains two overlapping groups, and each group has its own typical height and spread.

Here is a small sample:

PersonHeight (cm)Group
1162?
2175?
3158?
4181?
5168?
6155?
7178?
8171?

The group labels are unknown. Some heights clearly belong to the shorter group (155, 158). Others clearly belong to the taller group (178, 181). But what about 168 or 171? They sit right in the overlap zone. Assigning them to one group with 100% confidence would be wrong. They probably belong partially to both.

A Gaussian mixture model says your data came from K bell curves mixed together. Each bell curve has its own center (mean), width (variance), and relative size (mixing weight). The EM algorithm figures out which curve most likely generated each point.

EM algorithm: the iterative loop:

graph LR
  A["Guess initial<br/>parameters"] --> B["Assign: compute<br/>soft memberships"]
  B --> C["Update: recalculate<br/>means and variances"]
  C --> D{"Converged?"}
  D -->|No| B
  D -->|Yes| E["Final model"]

Mixture concept: two overlapping bell curves:

graph TD
  subgraph mixture["Observed Data"]
      G1["Bell curve 1:<br/>shorter group<br/>mean=160, sd=5"]
      G2["Bell curve 2:<br/>taller group<br/>mean=176, sd=5"]
  end
  G1 --> overlap["Overlap zone:<br/>heights 165 to 172<br/>could belong to either group"]
  G2 --> overlap
  overlap --> observed["Combined histogram<br/>shows two bumps"]

Two overlapping Gaussian distributions forming a mixture

EM starts with a rough guess for each bell curve’s parameters. Then it alternates: first compute how strongly each point belongs to each curve (soft assignment), then update each curve’s parameters using those soft assignments as weights. Each cycle improves the fit. After enough cycles, the parameters stabilize and you have your model.

The key difference from K-Means: EM gives probabilities, not hard labels. A point can be 70% group 1 and 30% group 2. This captures the uncertainty in the overlap region.

Now let’s write this down formally.

What is a Gaussian mixture model?

A GMM says: your data was generated by KK Gaussian distributions, each with its own mean μk\mu_k, variance σk2\sigma_k^2 (or covariance Σk\Sigma_k in higher dimensions), and a mixing weight πk\pi_k. The mixing weights satisfy k=1Kπk=1\sum_{k=1}^K \pi_k = 1.

The probability density of a data point xx is:

p(x)=k=1KπkN(xμk,σk2)p(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \sigma_k^2)

where N(xμ,σ2)\mathcal{N}(x \mid \mu, \sigma^2) is the Gaussian density:

N(xμ,σ2)=12πσexp((xμ)22σ2)\mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Each point comes from one of the KK components, but we do not know which one. The component labels are latent variables, hidden quantities that we need to infer.

Soft assignments vs hard assignments

In K-Means, point xix_i belongs to cluster kk or it does not. In a GMM, we compute the responsibility of each component for each point:

rik=P(component kxi)=πkN(xiμk,σk2)j=1KπjN(xiμj,σj2)r_{ik} = P(\text{component } k \mid x_i) = \frac{\pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \mu_j, \sigma_j^2)}

This is just Bayes’ theorem. The responsibility rikr_{ik} is a number between 0 and 1, and krik=1\sum_k r_{ik} = 1 for each point. A point can be 80% component 1 and 20% component 2.

Soft cluster assignments: opacity reflects membership probability

The EM algorithm

We cannot directly maximize the log-likelihood of a mixture model because the latent variables couple the parameters together. The EM algorithm sidesteps this by iterating between two steps:

E-step (Expectation): Given current parameters (πk,μk,σk2)(\pi_k, \mu_k, \sigma_k^2), compute responsibilities:

rik=πkN(xiμk,σk2)j=1KπjN(xiμj,σj2)r_{ik} = \frac{\pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \mu_j, \sigma_j^2)}

M-step (Maximization): Given responsibilities, update parameters:

Nk=i=1nrikN_k = \sum_{i=1}^n r_{ik}

μk=1Nki=1nrikxi\mu_k = \frac{1}{N_k}\sum_{i=1}^n r_{ik} \, x_i

σk2=1Nki=1nrik(xiμk)2\sigma_k^2 = \frac{1}{N_k}\sum_{i=1}^n r_{ik}(x_i - \mu_k)^2

πk=Nkn\pi_k = \frac{N_k}{n}

NkN_k is the “effective number of points” assigned to component kk. The M-step formulas are weighted versions of the sample mean and variance, with the responsibilities as weights.

graph LR
  A[Initialize parameters] --> B[E-step: compute responsibilities]
  B --> C[M-step: update parameters]
  C --> D{Converged?}
  D -->|No| B
  D -->|Yes| E[Done]

Log-likelihood

The log-likelihood for a GMM with nn data points is:

(θ)=i=1nlog(k=1KπkN(xiμk,σk2))\ell(\theta) = \sum_{i=1}^n \log \left(\sum_{k=1}^K \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2)\right)

EM is guaranteed to increase \ell at every iteration (or leave it unchanged at convergence). It will not decrease. However, like K-Means, EM only finds a local maximum, so initialization matters.

A common initialization strategy is to run K-Means first, use the resulting cluster assignments to set initial μk\mu_k and σk2\sigma_k^2, and set πk=Nk/n\pi_k = N_k / n.

Detailed E-step and M-step breakdown:

graph TD
  subgraph estep["E-step: Assign Responsibilities"]
      E1["For each point and<br/>each component"] --> E2["Compute how likely this<br/>component generated this point"]
      E2 --> E3["Normalize to get<br/>responsibility r_ik"]
  end
  subgraph mstep["M-step: Update Parameters"]
      M1["New mean: weighted<br/>average of all points"] --> M2["New variance: weighted<br/>spread around new mean"]
      M2 --> M3["New mixing weight:<br/>fraction of total responsibility"]
  end
  estep --> mstep
  mstep -->|"Repeat until<br/>parameters stabilize"| estep

Example 1: one full EM iteration on 1D data

Consider five data points and a mixture of two Gaussians (K=2K = 2):

x={1,2,3.5,5,6}x = \{1, 2, 3.5, 5, 6\}

Initial parameters:

ParameterComponent 1Component 2
μk\mu_k2255
σk\sigma_k1111
πk\pi_k0.50.50.50.5

E-step: compute responsibilities

For each point, we need N(xiμk,σk2)\mathcal{N}(x_i \mid \mu_k, \sigma_k^2). With σ=1\sigma = 1:

N(xμ,1)=12πexp((xμ)22)\mathcal{N}(x \mid \mu, 1) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2}\right)

For x=1x = 1:

N(12,1)=12πe0.50.2420\mathcal{N}(1 \mid 2, 1) = \frac{1}{\sqrt{2\pi}} e^{-0.5} \approx 0.2420

N(15,1)=12πe80.000134\mathcal{N}(1 \mid 5, 1) = \frac{1}{\sqrt{2\pi}} e^{-8} \approx 0.000134

r1,1=0.5×0.24200.5×0.2420+0.5×0.000134=0.12100.12110.999r_{1,1} = \frac{0.5 \times 0.2420}{0.5 \times 0.2420 + 0.5 \times 0.000134} = \frac{0.1210}{0.1211} \approx 0.999

For x=2x = 2:

N(22,1)=12πe00.3989\mathcal{N}(2 \mid 2, 1) = \frac{1}{\sqrt{2\pi}} e^{0} \approx 0.3989

N(25,1)=12πe4.50.0044\mathcal{N}(2 \mid 5, 1) = \frac{1}{\sqrt{2\pi}} e^{-4.5} \approx 0.0044

r2,1=0.5×0.39890.5×0.3989+0.5×0.0044=0.19950.20170.989r_{2,1} = \frac{0.5 \times 0.3989}{0.5 \times 0.3989 + 0.5 \times 0.0044} = \frac{0.1995}{0.2017} \approx 0.989

For x=3.5x = 3.5:

N(3.52,1)=12πe1.1250.1295\mathcal{N}(3.5 \mid 2, 1) = \frac{1}{\sqrt{2\pi}} e^{-1.125} \approx 0.1295

N(3.55,1)=12πe1.1250.1295\mathcal{N}(3.5 \mid 5, 1) = \frac{1}{\sqrt{2\pi}} e^{-1.125} \approx 0.1295

Both densities are equal (the point is equidistant from both means), so:

r3,1=0.500,r3,2=0.500r_{3,1} = 0.500, \quad r_{3,2} = 0.500

For x=5x = 5: By symmetry with x=2x = 2: r4,10.011r_{4,1} \approx 0.011, r4,20.989r_{4,2} \approx 0.989

For x=6x = 6: By symmetry with x=1x = 1: r5,10.001r_{5,1} \approx 0.001, r5,20.999r_{5,2} \approx 0.999

Summary of responsibilities:

xix_iri,1r_{i,1}ri,2r_{i,2}
10.9990.001
20.9890.011
3.50.5000.500
50.0110.989
60.0010.999

Notice: points near a component’s mean have high responsibility for that component. The middle point at x=3.5x = 3.5 is split equally.

M-step: update parameters

Effective counts:

N1=0.999+0.989+0.500+0.011+0.001=2.500N_1 = 0.999 + 0.989 + 0.500 + 0.011 + 0.001 = 2.500

N2=0.001+0.011+0.500+0.989+0.999=2.500N_2 = 0.001 + 0.011 + 0.500 + 0.989 + 0.999 = 2.500

Updated means:

μ1=0.999(1)+0.989(2)+0.500(3.5)+0.011(5)+0.001(6)2.500\mu_1 = \frac{0.999(1) + 0.989(2) + 0.500(3.5) + 0.011(5) + 0.001(6)}{2.500}

=0.999+1.978+1.750+0.055+0.0062.500=4.7882.500=1.915= \frac{0.999 + 1.978 + 1.750 + 0.055 + 0.006}{2.500} = \frac{4.788}{2.500} = 1.915

μ2=0.001(1)+0.011(2)+0.500(3.5)+0.989(5)+0.999(6)2.500\mu_2 = \frac{0.001(1) + 0.011(2) + 0.500(3.5) + 0.989(5) + 0.999(6)}{2.500}

=0.001+0.022+1.750+4.945+5.9942.500=12.7122.500=5.085= \frac{0.001 + 0.022 + 1.750 + 4.945 + 5.994}{2.500} = \frac{12.712}{2.500} = 5.085

Updated variances:

σ12=1N1ri,1(xiμ1)2\sigma_1^2 = \frac{1}{N_1}\sum r_{i,1}(x_i - \mu_1)^2

xix_i(xi1.915)2(x_i - 1.915)^2ri,1×(xi1.915)2r_{i,1} \times (x_i - 1.915)^2
10.8380.837
20.0070.007
3.52.5121.256
59.5120.105
616.6880.017

σ12=0.837+0.007+1.256+0.105+0.0172.500=2.2222.500=0.889\sigma_1^2 = \frac{0.837 + 0.007 + 1.256 + 0.105 + 0.017}{2.500} = \frac{2.222}{2.500} = 0.889

By symmetry of the data around 3.5: σ220.889\sigma_2^2 \approx 0.889 as well.

Updated mixing weights:

π1=2.5005=0.500,π2=0.500\pi_1 = \frac{2.500}{5} = 0.500, \quad \pi_2 = 0.500

After one EM iteration:

ParameterComponent 1Component 2
μk\mu_k1.9151.9155.0855.085
σk2\sigma_k^20.8890.8890.8890.889
πk\pi_k0.5000.5000.5000.500

The means moved slightly: μ1\mu_1 shifted from 2.0 to 1.915 (pulled toward the data at x=1x=1), and μ2\mu_2 shifted from 5.0 to 5.085 (pulled toward x=6x=6). The variances decreased from 1.0 to 0.889 because the model is becoming more confident about which points belong where.

Example 2: second EM iteration

Using the updated parameters, let’s run one more iteration.

E-step with new parameters

Now μ1=1.915\mu_1 = 1.915, μ2=5.085\mu_2 = 5.085, σ1=σ2=0.8890.943\sigma_1 = \sigma_2 = \sqrt{0.889} \approx 0.943.

For x=1x = 1:

(11.915)22×0.889=0.8381.778=0.471\frac{(1 - 1.915)^2}{2 \times 0.889} = \frac{0.838}{1.778} = 0.471

N(11.915,0.889)e0.4710.624\mathcal{N}(1 \mid 1.915, 0.889) \propto e^{-0.471} \approx 0.624

(15.085)22×0.889=16.6881.778=9.385\frac{(1 - 5.085)^2}{2 \times 0.889} = \frac{16.688}{1.778} = 9.385

N(15.085,0.889)e9.3850.0001\mathcal{N}(1 \mid 5.085, 0.889) \propto e^{-9.385} \approx 0.0001

r1,10.6240.624+0.00010.9998r_{1,1} \approx \frac{0.624}{0.624 + 0.0001} \approx 0.9998

For x=2x = 2:

(21.915)22×0.889=0.0071.778=0.004\frac{(2 - 1.915)^2}{2 \times 0.889} = \frac{0.007}{1.778} = 0.004

Ne0.0040.996\mathcal{N} \propto e^{-0.004} \approx 0.996

(25.085)22×0.889=9.5121.778=5.350\frac{(2 - 5.085)^2}{2 \times 0.889} = \frac{9.512}{1.778} = 5.350

Ne5.3500.0047\mathcal{N} \propto e^{-5.350} \approx 0.0047

r2,10.9960.996+0.00470.995r_{2,1} \approx \frac{0.996}{0.996 + 0.0047} \approx 0.995

For x=3.5x = 3.5:

(3.51.915)22×0.889=2.5121.778=1.413\frac{(3.5 - 1.915)^2}{2 \times 0.889} = \frac{2.512}{1.778} = 1.413

Ne1.4130.244\mathcal{N} \propto e^{-1.413} \approx 0.244

(3.55.085)22×0.889=2.5121.778=1.413\frac{(3.5 - 5.085)^2}{2 \times 0.889} = \frac{2.512}{1.778} = 1.413

Ne1.4130.244\mathcal{N} \propto e^{-1.413} \approx 0.244

r3,1=0.500(still exactly equal)r_{3,1} = 0.500 \quad \text{(still exactly equal)}

Updated responsibilities after iteration 2:

xix_iri,1r_{i,1}ri,2r_{i,2}
10.99980.0002
20.9950.005
3.50.5000.500
50.0050.995
60.00020.9998

The responsibilities sharpened. Points 1 and 6 are now assigned almost entirely to their respective components. The model is becoming more confident with each iteration.

M-step update

N1=0.9998+0.995+0.500+0.005+0.0002=2.500N_1 = 0.9998 + 0.995 + 0.500 + 0.005 + 0.0002 = 2.500

μ1=0.9998(1)+0.995(2)+0.500(3.5)+0.005(5)+0.0002(6)2.500=4.7652.500=1.906\mu_1 = \frac{0.9998(1) + 0.995(2) + 0.500(3.5) + 0.005(5) + 0.0002(6)}{2.500} = \frac{4.765}{2.500} = 1.906

The mean barely moved (1.9151.9061.915 \to 1.906). Convergence is near.

Convergence

EM converges when the log-likelihood stops increasing (or the parameter changes fall below a threshold). A few key properties:

  • EM never decreases the log-likelihood. Each iteration either improves it or leaves it the same.
  • EM converges to a local maximum, not necessarily the global one. Multiple restarts with different initializations help.
  • Convergence can be slow near the maximum. The rate is linear, not quadratic like Newton’s method.
  • EM can get stuck if a component “collapses” onto a single data point, driving σk0\sigma_k \to 0 and the likelihood to infinity. A common fix is to set a floor on the variance or add a small regularization term.

GMM vs K-Means

K-MeansGMM
AssignmentsHard (0 or 1)Soft (probabilities)
Cluster shapeSphericalElliptical (with full covariance)
ObjectiveWCSSLog-likelihood
AlgorithmLloyd’s algorithmEM
OutputCluster labelsProbabilities + density model

GMMs are more flexible but have more parameters. K-Means is a special case of EM for GMMs where all covariances are σ2I\sigma^2 I and responsibilities are forced to 0 or 1.

K-Means hard assignment vs GMM soft assignment:

graph TD
  P["Point sits between<br/>two cluster centers"] --> KM["K-Means:<br/>100% to nearest cluster"]
  P --> GMM["GMM:<br/>70% cluster A,<br/>30% cluster B"]
  KM --> KR["Hard boundary:<br/>point belongs to exactly<br/>one cluster"]
  GMM --> GR["Soft boundary:<br/>point has partial membership<br/>in multiple clusters"]

Higher dimensions

In dd dimensions, each component has a dd-dimensional mean μk\mu_k and a d×dd \times d covariance matrix Σk\Sigma_k. The density becomes:

N(xμk,Σk)=1(2π)d/2Σk1/2exp(12(xμk)TΣk1(xμk))\mathcal{N}(x \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k)\right)

The M-step for the covariance is:

Σk=1Nki=1nrik(xiμk)(xiμk)T\Sigma_k = \frac{1}{N_k}\sum_{i=1}^n r_{ik}(x_i - \mu_k)(x_i - \mu_k)^T

With a full covariance matrix, each component can model an elliptical cluster oriented in any direction. You can also restrict Σk\Sigma_k to be diagonal (axis-aligned ellipses) or spherical (σk2I\sigma_k^2 I) to reduce the number of parameters.

Implementation

import numpy as np
from scipy.stats import norm

def gmm_em(X, K, max_iters=50, tol=1e-6):
    n = len(X)
    # Initialize with K-Means-style random picks
    mu = np.random.choice(X, K, replace=False).astype(float)
    sigma2 = np.ones(K)
    pi = np.ones(K) / K
    
    for iteration in range(max_iters):
        # E-step
        resp = np.zeros((n, K))
        for k in range(K):
            resp[:, k] = pi[k] * norm.pdf(X, mu[k], np.sqrt(sigma2[k]))
        resp /= resp.sum(axis=1, keepdims=True)
        
        # M-step
        Nk = resp.sum(axis=0)
        mu_new = (resp * X[:, None]).sum(axis=0) / Nk
        sigma2_new = np.array([
            np.sum(resp[:, k] * (X - mu_new[k])**2) / Nk[k]
            for k in range(K)
        ])
        pi_new = Nk / n
        
        # Check convergence
        if np.allclose(mu, mu_new, atol=tol):
            break
        mu, sigma2, pi = mu_new, sigma2_new, pi_new
    
    return mu, sigma2, pi, resp

Summary

ConceptKey idea
GMMModel data as a mixture of KK Gaussians
ResponsibilitySoft assignment: probability that point ii came from component kk
E-stepCompute responsibilities using current parameters
M-stepUpdate parameters using weighted statistics
Log-likelihoodEM increases it every iteration; converges to local max
vs K-MeansGMM gives soft assignments and models elliptical clusters

What comes next

Both K-Means and GMMs require choosing the number of clusters. How do you pick the right model complexity, not just for clustering but for any machine learning model? In the next post on model selection and cross-validation, you will learn systematic ways to compare models and tune hyperparameters.

Start typing to search across all content
navigate Enter open Esc close