Apr 7, 2026 · 20 min read · Machine Learning

Gaussian mixture models and EM algorithm

In this series (18 parts)

K-Means assigns each point to exactly one cluster. That is a hard decision. But real data often sits between clusters, and pretending otherwise throws away useful information. Gaussian mixture models (GMMs) fix this by giving each point a probability of belonging to each cluster. The EM algorithm is how we fit these models.

Prerequisites: You should understand probability distributions (especially the Gaussian) and K-Means clustering.

Why GMMs and EM?

You measure the heights of 1000 people. The histogram shows two bumps, not one. Why? Because your sample contains two overlapping groups, and each group has its own typical height and spread.

Here is a small sample:

Person	Height (cm)	Group
1	162	?
2	175	?
3	158	?
4	181	?
5	168	?
6	155	?
7	178	?
8	171	?

The group labels are unknown. Some heights clearly belong to the shorter group (155, 158). Others clearly belong to the taller group (178, 181). But what about 168 or 171? They sit right in the overlap zone. Assigning them to one group with 100% confidence would be wrong. They probably belong partially to both.

A Gaussian mixture model says your data came from K bell curves mixed together. Each bell curve has its own center (mean), width (variance), and relative size (mixing weight). The EM algorithm figures out which curve most likely generated each point.

EM algorithm: the iterative loop:

graph LR
  A["Guess initial<br/>parameters"] --> B["Assign: compute<br/>soft memberships"]
  B --> C["Update: recalculate<br/>means and variances"]
  C --> D{"Converged?"}
  D -->|No| B
  D -->|Yes| E["Final model"]

Mixture concept: two overlapping bell curves:

graph TD
  subgraph mixture["Observed Data"]
      G1["Bell curve 1:<br/>shorter group<br/>mean=160, sd=5"]
      G2["Bell curve 2:<br/>taller group<br/>mean=176, sd=5"]
  end
  G1 --> overlap["Overlap zone:<br/>heights 165 to 172<br/>could belong to either group"]
  G2 --> overlap
  overlap --> observed["Combined histogram<br/>shows two bumps"]

Two overlapping Gaussian distributions forming a mixture

EM starts with a rough guess for each bell curve’s parameters. Then it alternates: first compute how strongly each point belongs to each curve (soft assignment), then update each curve’s parameters using those soft assignments as weights. Each cycle improves the fit. After enough cycles, the parameters stabilize and you have your model.

The key difference from K-Means: EM gives probabilities, not hard labels. A point can be 70% group 1 and 30% group 2. This captures the uncertainty in the overlap region.

Now let’s write this down formally.

What is a Gaussian mixture model?

A GMM says: your data was generated by $K$ Gaussian distributions, each with its own mean $\mu_k$ , variance $\sigma_k^2$ (or covariance $\Sigma_k$ in higher dimensions), and a mixing weight $\pi_k$ . The mixing weights satisfy $\sum_{k=1}^K \pi_k = 1$ .

The probability density of a data point $x$ is:

$p(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \sigma_k^2)$

where $\mathcal{N}(x \mid \mu, \sigma^2)$ is the Gaussian density:

$\mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$

Each point comes from one of the $K$ components, but we do not know which one. The component labels are latent variables, hidden quantities that we need to infer.

Soft assignments vs hard assignments

In K-Means, point $x_i$ belongs to cluster $k$ or it does not. In a GMM, we compute the responsibility of each component for each point:

$r_{ik} = P(\text{component } k \mid x_i) = \frac{\pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \mu_j, \sigma_j^2)}$

This is just Bayes’ theorem. The responsibility $r_{ik}$ is a number between 0 and 1, and $\sum_k r_{ik} = 1$ for each point. A point can be 80% component 1 and 20% component 2.

Soft cluster assignments: opacity reflects membership probability

The EM algorithm

We cannot directly maximize the log-likelihood of a mixture model because the latent variables couple the parameters together. The EM algorithm sidesteps this by iterating between two steps:

E-step (Expectation): Given current parameters $(\pi_k, \mu_k, \sigma_k^2)$ , compute responsibilities:

$r_{ik} = \frac{\pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \mu_j, \sigma_j^2)}$

M-step (Maximization): Given responsibilities, update parameters:

$N_k = \sum_{i=1}^n r_{ik}$

$\mu_k = \frac{1}{N_k}\sum_{i=1}^n r_{ik} \, x_i$

$\sigma_k^2 = \frac{1}{N_k}\sum_{i=1}^n r_{ik}(x_i - \mu_k)^2$

$\pi_k = \frac{N_k}{n}$

$N_k$ is the “effective number of points” assigned to component $k$ . The M-step formulas are weighted versions of the sample mean and variance, with the responsibilities as weights.

graph LR
  A[Initialize parameters] --> B[E-step: compute responsibilities]
  B --> C[M-step: update parameters]
  C --> D{Converged?}
  D -->|No| B
  D -->|Yes| E[Done]

Log-likelihood

The log-likelihood for a GMM with $n$ data points is:

$\ell(\theta) = \sum_{i=1}^n \log \left(\sum_{k=1}^K \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2)\right)$

EM is guaranteed to increase $\ell$ at every iteration (or leave it unchanged at convergence). It will not decrease. However, like K-Means, EM only finds a local maximum, so initialization matters.

A common initialization strategy is to run K-Means first, use the resulting cluster assignments to set initial $\mu_k$ and $\sigma_k^2$ , and set $\pi_k = N_k / n$ .

Detailed E-step and M-step breakdown:

graph TD
  subgraph estep["E-step: Assign Responsibilities"]
      E1["For each point and<br/>each component"] --> E2["Compute how likely this<br/>component generated this point"]
      E2 --> E3["Normalize to get<br/>responsibility r_ik"]
  end
  subgraph mstep["M-step: Update Parameters"]
      M1["New mean: weighted<br/>average of all points"] --> M2["New variance: weighted<br/>spread around new mean"]
      M2 --> M3["New mixing weight:<br/>fraction of total responsibility"]
  end
  estep --> mstep
  mstep -->|"Repeat until<br/>parameters stabilize"| estep

Example 1: one full EM iteration on 1D data

Consider five data points and a mixture of two Gaussians ( $K = 2$ ):

$x = \{1, 2, 3.5, 5, 6\}$

Initial parameters:

Parameter	Component 1	Component 2
$\mu_k$	$2$	$5$
$\sigma_k$	$1$	$1$
$\pi_k$	$0.5$	$0.5$

E-step: compute responsibilities

For each point, we need $\mathcal{N}(x_i \mid \mu_k, \sigma_k^2)$ . With $\sigma = 1$ :

$\mathcal{N}(x \mid \mu, 1) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2}\right)$

For $x = 1$ :

$\mathcal{N}(1 \mid 2, 1) = \frac{1}{\sqrt{2\pi}} e^{-0.5} \approx 0.2420$

$\mathcal{N}(1 \mid 5, 1) = \frac{1}{\sqrt{2\pi}} e^{-8} \approx 0.000134$

$r_{1,1} = \frac{0.5 \times 0.2420}{0.5 \times 0.2420 + 0.5 \times 0.000134} = \frac{0.1210}{0.1211} \approx 0.999$

For $x = 2$ :

$\mathcal{N}(2 \mid 2, 1) = \frac{1}{\sqrt{2\pi}} e^{0} \approx 0.3989$

$\mathcal{N}(2 \mid 5, 1) = \frac{1}{\sqrt{2\pi}} e^{-4.5} \approx 0.0044$

$r_{2,1} = \frac{0.5 \times 0.3989}{0.5 \times 0.3989 + 0.5 \times 0.0044} = \frac{0.1995}{0.2017} \approx 0.989$

For $x = 3.5$ :

$\mathcal{N}(3.5 \mid 2, 1) = \frac{1}{\sqrt{2\pi}} e^{-1.125} \approx 0.1295$

$\mathcal{N}(3.5 \mid 5, 1) = \frac{1}{\sqrt{2\pi}} e^{-1.125} \approx 0.1295$

Both densities are equal (the point is equidistant from both means), so:

$r_{3,1} = 0.500, \quad r_{3,2} = 0.500$

For $x = 5$ : By symmetry with $x = 2$ : $r_{4,1} \approx 0.011$ , $r_{4,2} \approx 0.989$

For $x = 6$ : By symmetry with $x = 1$ : $r_{5,1} \approx 0.001$ , $r_{5,2} \approx 0.999$

Summary of responsibilities:

$x_i$	$r_{i,1}$	$r_{i,2}$
1	0.999	0.001
2	0.989	0.011
3.5	0.500	0.500
5	0.011	0.989
6	0.001	0.999

Notice: points near a component’s mean have high responsibility for that component. The middle point at $x = 3.5$ is split equally.

M-step: update parameters

Effective counts:

$N_1 = 0.999 + 0.989 + 0.500 + 0.011 + 0.001 = 2.500$

$N_2 = 0.001 + 0.011 + 0.500 + 0.989 + 0.999 = 2.500$

Updated means:

$\mu_1 = \frac{0.999(1) + 0.989(2) + 0.500(3.5) + 0.011(5) + 0.001(6)}{2.500}$

$= \frac{0.999 + 1.978 + 1.750 + 0.055 + 0.006}{2.500} = \frac{4.788}{2.500} = 1.915$

$\mu_2 = \frac{0.001(1) + 0.011(2) + 0.500(3.5) + 0.989(5) + 0.999(6)}{2.500}$

$= \frac{0.001 + 0.022 + 1.750 + 4.945 + 5.994}{2.500} = \frac{12.712}{2.500} = 5.085$

Updated variances:

$\sigma_1^2 = \frac{1}{N_1}\sum r_{i,1}(x_i - \mu_1)^2$

$x_i$	$(x_i - 1.915)^2$	$r_{i,1} \times (x_i - 1.915)^2$
1	0.838	0.837
2	0.007	0.007
3.5	2.512	1.256
5	9.512	0.105
6	16.688	0.017

$\sigma_1^2 = \frac{0.837 + 0.007 + 1.256 + 0.105 + 0.017}{2.500} = \frac{2.222}{2.500} = 0.889$

By symmetry of the data around 3.5: $\sigma_2^2 \approx 0.889$ as well.

Updated mixing weights:

$\pi_1 = \frac{2.500}{5} = 0.500, \quad \pi_2 = 0.500$

After one EM iteration:

Parameter	Component 1	Component 2
$\mu_k$	$1.915$	$5.085$
$\sigma_k^2$	$0.889$	$0.889$
$\pi_k$	$0.500$	$0.500$

The means moved slightly: $\mu_1$ shifted from 2.0 to 1.915 (pulled toward the data at $x=1$ ), and $\mu_2$ shifted from 5.0 to 5.085 (pulled toward $x=6$ ). The variances decreased from 1.0 to 0.889 because the model is becoming more confident about which points belong where.

Example 2: second EM iteration

Using the updated parameters, let’s run one more iteration.

E-step with new parameters

Now $\mu_1 = 1.915$ , $\mu_2 = 5.085$ , $\sigma_1 = \sigma_2 = \sqrt{0.889} \approx 0.943$ .

For $x = 1$ :

$\frac{(1 - 1.915)^2}{2 \times 0.889} = \frac{0.838}{1.778} = 0.471$

$\mathcal{N}(1 \mid 1.915, 0.889) \propto e^{-0.471} \approx 0.624$

$\frac{(1 - 5.085)^2}{2 \times 0.889} = \frac{16.688}{1.778} = 9.385$

$\mathcal{N}(1 \mid 5.085, 0.889) \propto e^{-9.385} \approx 0.0001$

$r_{1,1} \approx \frac{0.624}{0.624 + 0.0001} \approx 0.9998$

For $x = 2$ :

$\frac{(2 - 1.915)^2}{2 \times 0.889} = \frac{0.007}{1.778} = 0.004$

$\mathcal{N} \propto e^{-0.004} \approx 0.996$

$\frac{(2 - 5.085)^2}{2 \times 0.889} = \frac{9.512}{1.778} = 5.350$

$\mathcal{N} \propto e^{-5.350} \approx 0.0047$

$r_{2,1} \approx \frac{0.996}{0.996 + 0.0047} \approx 0.995$

For $x = 3.5$ :

$\frac{(3.5 - 1.915)^2}{2 \times 0.889} = \frac{2.512}{1.778} = 1.413$

$\mathcal{N} \propto e^{-1.413} \approx 0.244$

$\frac{(3.5 - 5.085)^2}{2 \times 0.889} = \frac{2.512}{1.778} = 1.413$

$\mathcal{N} \propto e^{-1.413} \approx 0.244$

$r_{3,1} = 0.500 \quad \text{(still exactly equal)}$

Updated responsibilities after iteration 2:

$x_i$	$r_{i,1}$	$r_{i,2}$
1	0.9998	0.0002
2	0.995	0.005
3.5	0.500	0.500
5	0.005	0.995
6	0.0002	0.9998

The responsibilities sharpened. Points 1 and 6 are now assigned almost entirely to their respective components. The model is becoming more confident with each iteration.

M-step update

$N_1 = 0.9998 + 0.995 + 0.500 + 0.005 + 0.0002 = 2.500$

$\mu_1 = \frac{0.9998(1) + 0.995(2) + 0.500(3.5) + 0.005(5) + 0.0002(6)}{2.500} = \frac{4.765}{2.500} = 1.906$

The mean barely moved ( $1.915 \to 1.906$ ). Convergence is near.

Convergence

EM converges when the log-likelihood stops increasing (or the parameter changes fall below a threshold). A few key properties:

EM never decreases the log-likelihood. Each iteration either improves it or leaves it the same.
EM converges to a local maximum, not necessarily the global one. Multiple restarts with different initializations help.
Convergence can be slow near the maximum. The rate is linear, not quadratic like Newton’s method.
EM can get stuck if a component “collapses” onto a single data point, driving $\sigma_k \to 0$ and the likelihood to infinity. A common fix is to set a floor on the variance or add a small regularization term.

GMM vs K-Means

	K-Means	GMM
Assignments	Hard (0 or 1)	Soft (probabilities)
Cluster shape	Spherical	Elliptical (with full covariance)
Objective	WCSS	Log-likelihood
Algorithm	Lloyd’s algorithm	EM
Output	Cluster labels	Probabilities + density model

GMMs are more flexible but have more parameters. K-Means is a special case of EM for GMMs where all covariances are $\sigma^2 I$ and responsibilities are forced to 0 or 1.

K-Means hard assignment vs GMM soft assignment:

graph TD
  P["Point sits between<br/>two cluster centers"] --> KM["K-Means:<br/>100% to nearest cluster"]
  P --> GMM["GMM:<br/>70% cluster A,<br/>30% cluster B"]
  KM --> KR["Hard boundary:<br/>point belongs to exactly<br/>one cluster"]
  GMM --> GR["Soft boundary:<br/>point has partial membership<br/>in multiple clusters"]

Higher dimensions

In $d$ dimensions, each component has a $d$ -dimensional mean $\mu_k$ and a $d \times d$ covariance matrix $\Sigma_k$ . The density becomes:

$\mathcal{N}(x \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k)\right)$

The M-step for the covariance is:

$\Sigma_k = \frac{1}{N_k}\sum_{i=1}^n r_{ik}(x_i - \mu_k)(x_i - \mu_k)^T$

With a full covariance matrix, each component can model an elliptical cluster oriented in any direction. You can also restrict $\Sigma_k$ to be diagonal (axis-aligned ellipses) or spherical ( $\sigma_k^2 I$ ) to reduce the number of parameters.

Implementation

import numpy as np
from scipy.stats import norm

def gmm_em(X, K, max_iters=50, tol=1e-6):
    n = len(X)
    # Initialize with K-Means-style random picks
    mu = np.random.choice(X, K, replace=False).astype(float)
    sigma2 = np.ones(K)
    pi = np.ones(K) / K
    
    for iteration in range(max_iters):
        # E-step
        resp = np.zeros((n, K))
        for k in range(K):
            resp[:, k] = pi[k] * norm.pdf(X, mu[k], np.sqrt(sigma2[k]))
        resp /= resp.sum(axis=1, keepdims=True)
        
        # M-step
        Nk = resp.sum(axis=0)
        mu_new = (resp * X[:, None]).sum(axis=0) / Nk
        sigma2_new = np.array([
            np.sum(resp[:, k] * (X - mu_new[k])**2) / Nk[k]
            for k in range(K)
        ])
        pi_new = Nk / n
        
        # Check convergence
        if np.allclose(mu, mu_new, atol=tol):
            break
        mu, sigma2, pi = mu_new, sigma2_new, pi_new
    
    return mu, sigma2, pi, resp

Summary

Concept	Key idea
GMM	Model data as a mixture of $K$ Gaussians
Responsibility	Soft assignment: probability that point $i$ came from component $k$
E-step	Compute responsibilities using current parameters
M-step	Update parameters using weighted statistics
Log-likelihood	EM increases it every iteration; converges to local max
vs K-Means	GMM gives soft assignments and models elliptical clusters

What comes next

Both K-Means and GMMs require choosing the number of clusters. How do you pick the right model complexity, not just for clustering but for any machine learning model? In the next post on model selection and cross-validation, you will learn systematic ways to compare models and tune hyperparameters.

← Back to all series