Gaussian mixture models and EM algorithm
In this series (18 parts)
- What is machine learning: a map of the field
- Data, features, and the ML pipeline
- Linear regression
- Bias, variance, and the tradeoff
- Regularization: Ridge, Lasso, and ElasticNet
- Logistic regression and classification
- Evaluation metrics for classification
- Naive Bayes classifier
- K-Nearest Neighbors
- Decision trees
- Ensemble methods: Bagging and Random Forests
- Boosting: AdaBoost and Gradient Boosting
- Support Vector Machines
- K-Means clustering
- Dimensionality Reduction: PCA
- Gaussian mixture models and EM algorithm
- Model selection and cross-validation
- Feature engineering and selection
K-Means assigns each point to exactly one cluster. That is a hard decision. But real data often sits between clusters, and pretending otherwise throws away useful information. Gaussian mixture models (GMMs) fix this by giving each point a probability of belonging to each cluster. The EM algorithm is how we fit these models.
Prerequisites: You should understand probability distributions (especially the Gaussian) and K-Means clustering.
Why GMMs and EM?
You measure the heights of 1000 people. The histogram shows two bumps, not one. Why? Because your sample contains two overlapping groups, and each group has its own typical height and spread.
Here is a small sample:
| Person | Height (cm) | Group |
|---|---|---|
| 1 | 162 | ? |
| 2 | 175 | ? |
| 3 | 158 | ? |
| 4 | 181 | ? |
| 5 | 168 | ? |
| 6 | 155 | ? |
| 7 | 178 | ? |
| 8 | 171 | ? |
The group labels are unknown. Some heights clearly belong to the shorter group (155, 158). Others clearly belong to the taller group (178, 181). But what about 168 or 171? They sit right in the overlap zone. Assigning them to one group with 100% confidence would be wrong. They probably belong partially to both.
A Gaussian mixture model says your data came from K bell curves mixed together. Each bell curve has its own center (mean), width (variance), and relative size (mixing weight). The EM algorithm figures out which curve most likely generated each point.
EM algorithm: the iterative loop:
graph LR
A["Guess initial<br/>parameters"] --> B["Assign: compute<br/>soft memberships"]
B --> C["Update: recalculate<br/>means and variances"]
C --> D{"Converged?"}
D -->|No| B
D -->|Yes| E["Final model"]
Mixture concept: two overlapping bell curves:
graph TD
subgraph mixture["Observed Data"]
G1["Bell curve 1:<br/>shorter group<br/>mean=160, sd=5"]
G2["Bell curve 2:<br/>taller group<br/>mean=176, sd=5"]
end
G1 --> overlap["Overlap zone:<br/>heights 165 to 172<br/>could belong to either group"]
G2 --> overlap
overlap --> observed["Combined histogram<br/>shows two bumps"]
Two overlapping Gaussian distributions forming a mixture
EM starts with a rough guess for each bell curve’s parameters. Then it alternates: first compute how strongly each point belongs to each curve (soft assignment), then update each curve’s parameters using those soft assignments as weights. Each cycle improves the fit. After enough cycles, the parameters stabilize and you have your model.
The key difference from K-Means: EM gives probabilities, not hard labels. A point can be 70% group 1 and 30% group 2. This captures the uncertainty in the overlap region.
Now let’s write this down formally.
What is a Gaussian mixture model?
A GMM says: your data was generated by Gaussian distributions, each with its own mean , variance (or covariance in higher dimensions), and a mixing weight . The mixing weights satisfy .
The probability density of a data point is:
where is the Gaussian density:
Each point comes from one of the components, but we do not know which one. The component labels are latent variables, hidden quantities that we need to infer.
Soft assignments vs hard assignments
In K-Means, point belongs to cluster or it does not. In a GMM, we compute the responsibility of each component for each point:
This is just Bayes’ theorem. The responsibility is a number between 0 and 1, and for each point. A point can be 80% component 1 and 20% component 2.
Soft cluster assignments: opacity reflects membership probability
The EM algorithm
We cannot directly maximize the log-likelihood of a mixture model because the latent variables couple the parameters together. The EM algorithm sidesteps this by iterating between two steps:
E-step (Expectation): Given current parameters , compute responsibilities:
M-step (Maximization): Given responsibilities, update parameters:
is the “effective number of points” assigned to component . The M-step formulas are weighted versions of the sample mean and variance, with the responsibilities as weights.
graph LR
A[Initialize parameters] --> B[E-step: compute responsibilities]
B --> C[M-step: update parameters]
C --> D{Converged?}
D -->|No| B
D -->|Yes| E[Done]
Log-likelihood
The log-likelihood for a GMM with data points is:
EM is guaranteed to increase at every iteration (or leave it unchanged at convergence). It will not decrease. However, like K-Means, EM only finds a local maximum, so initialization matters.
A common initialization strategy is to run K-Means first, use the resulting cluster assignments to set initial and , and set .
Detailed E-step and M-step breakdown:
graph TD
subgraph estep["E-step: Assign Responsibilities"]
E1["For each point and<br/>each component"] --> E2["Compute how likely this<br/>component generated this point"]
E2 --> E3["Normalize to get<br/>responsibility r_ik"]
end
subgraph mstep["M-step: Update Parameters"]
M1["New mean: weighted<br/>average of all points"] --> M2["New variance: weighted<br/>spread around new mean"]
M2 --> M3["New mixing weight:<br/>fraction of total responsibility"]
end
estep --> mstep
mstep -->|"Repeat until<br/>parameters stabilize"| estep
Example 1: one full EM iteration on 1D data
Consider five data points and a mixture of two Gaussians ():
Initial parameters:
| Parameter | Component 1 | Component 2 |
|---|---|---|
E-step: compute responsibilities
For each point, we need . With :
For :
For :
For :
Both densities are equal (the point is equidistant from both means), so:
For : By symmetry with : ,
For : By symmetry with : ,
Summary of responsibilities:
| 1 | 0.999 | 0.001 |
| 2 | 0.989 | 0.011 |
| 3.5 | 0.500 | 0.500 |
| 5 | 0.011 | 0.989 |
| 6 | 0.001 | 0.999 |
Notice: points near a component’s mean have high responsibility for that component. The middle point at is split equally.
M-step: update parameters
Effective counts:
Updated means:
Updated variances:
| 1 | 0.838 | 0.837 |
| 2 | 0.007 | 0.007 |
| 3.5 | 2.512 | 1.256 |
| 5 | 9.512 | 0.105 |
| 6 | 16.688 | 0.017 |
By symmetry of the data around 3.5: as well.
Updated mixing weights:
After one EM iteration:
| Parameter | Component 1 | Component 2 |
|---|---|---|
The means moved slightly: shifted from 2.0 to 1.915 (pulled toward the data at ), and shifted from 5.0 to 5.085 (pulled toward ). The variances decreased from 1.0 to 0.889 because the model is becoming more confident about which points belong where.
Example 2: second EM iteration
Using the updated parameters, let’s run one more iteration.
E-step with new parameters
Now , , .
For :
For :
For :
Updated responsibilities after iteration 2:
| 1 | 0.9998 | 0.0002 |
| 2 | 0.995 | 0.005 |
| 3.5 | 0.500 | 0.500 |
| 5 | 0.005 | 0.995 |
| 6 | 0.0002 | 0.9998 |
The responsibilities sharpened. Points 1 and 6 are now assigned almost entirely to their respective components. The model is becoming more confident with each iteration.
M-step update
The mean barely moved (). Convergence is near.
Convergence
EM converges when the log-likelihood stops increasing (or the parameter changes fall below a threshold). A few key properties:
- EM never decreases the log-likelihood. Each iteration either improves it or leaves it the same.
- EM converges to a local maximum, not necessarily the global one. Multiple restarts with different initializations help.
- Convergence can be slow near the maximum. The rate is linear, not quadratic like Newton’s method.
- EM can get stuck if a component “collapses” onto a single data point, driving and the likelihood to infinity. A common fix is to set a floor on the variance or add a small regularization term.
GMM vs K-Means
| K-Means | GMM | |
|---|---|---|
| Assignments | Hard (0 or 1) | Soft (probabilities) |
| Cluster shape | Spherical | Elliptical (with full covariance) |
| Objective | WCSS | Log-likelihood |
| Algorithm | Lloyd’s algorithm | EM |
| Output | Cluster labels | Probabilities + density model |
GMMs are more flexible but have more parameters. K-Means is a special case of EM for GMMs where all covariances are and responsibilities are forced to 0 or 1.
K-Means hard assignment vs GMM soft assignment:
graph TD P["Point sits between<br/>two cluster centers"] --> KM["K-Means:<br/>100% to nearest cluster"] P --> GMM["GMM:<br/>70% cluster A,<br/>30% cluster B"] KM --> KR["Hard boundary:<br/>point belongs to exactly<br/>one cluster"] GMM --> GR["Soft boundary:<br/>point has partial membership<br/>in multiple clusters"]
Higher dimensions
In dimensions, each component has a -dimensional mean and a covariance matrix . The density becomes:
The M-step for the covariance is:
With a full covariance matrix, each component can model an elliptical cluster oriented in any direction. You can also restrict to be diagonal (axis-aligned ellipses) or spherical () to reduce the number of parameters.
Implementation
import numpy as np
from scipy.stats import norm
def gmm_em(X, K, max_iters=50, tol=1e-6):
n = len(X)
# Initialize with K-Means-style random picks
mu = np.random.choice(X, K, replace=False).astype(float)
sigma2 = np.ones(K)
pi = np.ones(K) / K
for iteration in range(max_iters):
# E-step
resp = np.zeros((n, K))
for k in range(K):
resp[:, k] = pi[k] * norm.pdf(X, mu[k], np.sqrt(sigma2[k]))
resp /= resp.sum(axis=1, keepdims=True)
# M-step
Nk = resp.sum(axis=0)
mu_new = (resp * X[:, None]).sum(axis=0) / Nk
sigma2_new = np.array([
np.sum(resp[:, k] * (X - mu_new[k])**2) / Nk[k]
for k in range(K)
])
pi_new = Nk / n
# Check convergence
if np.allclose(mu, mu_new, atol=tol):
break
mu, sigma2, pi = mu_new, sigma2_new, pi_new
return mu, sigma2, pi, resp
Summary
| Concept | Key idea |
|---|---|
| GMM | Model data as a mixture of Gaussians |
| Responsibility | Soft assignment: probability that point came from component |
| E-step | Compute responsibilities using current parameters |
| M-step | Update parameters using weighted statistics |
| Log-likelihood | EM increases it every iteration; converges to local max |
| vs K-Means | GMM gives soft assignments and models elliptical clusters |
What comes next
Both K-Means and GMMs require choosing the number of clusters. How do you pick the right model complexity, not just for clustering but for any machine learning model? In the next post on model selection and cross-validation, you will learn systematic ways to compare models and tune hyperparameters.