Information theory: entropy, KL divergence, cross-entropy
In this series (15 parts)
- Why Maths Matters for ML: A Practical Overview
- Scalars, Vectors, and Vector Spaces
- Matrices and Matrix Operations
- Matrix Inverses and Systems of Linear Equations
- Eigenvalues and Eigenvectors
- Matrix Decompositions: LU, QR, SVD
- Norms, Distances, and Similarity
- Calculus Review: Derivatives and the Chain Rule
- Partial Derivatives and Gradients
- The Jacobian and Hessian Matrices
- Taylor series and local approximations
- Probability fundamentals
- Random variables and distributions
- Bayes theorem and its role in ML
- Information theory: entropy, KL divergence, cross-entropy
Cross-entropy loss is the default loss function for classification in ML. But where does it come from? The answer lies in information theory, a field that Claude Shannon created in 1948 to study communication. The core ideas, entropy, KL divergence, and cross-entropy, turn out to be exactly what you need to measure how well a model’s predicted distribution matches the true distribution of the data.
Prerequisites
You should be familiar with random variables, distributions, and expectation before reading this.
Information content (surprise)
Start with a single event. If an event has probability , its information content (or “surprise”) is:
Why negative log? Three reasons:
- Certain events () carry zero information:
- Rare events carry more information: as ,
- Independent events have additive information:
Example: A fair coin flip has , so bit. That is one bit of information, which makes intuitive sense because you need exactly one binary digit to encode heads or tails.
If the coin is biased with , then heads carries bits (not very surprising) and tails carries bits (very surprising).
Shannon entropy
Entropy is the expected information content, the average surprise:
For continuous distributions, replace the sum with an integral (called “differential entropy”). In ML, we often use natural log () instead of , which gives entropy in “nats” instead of “bits.” The math is identical up to a constant factor.
Entropy measures the inherent uncertainty in a distribution. High entropy means the distribution is spread out (hard to predict). Low entropy means the distribution is concentrated (easy to predict).
Example 1: Computing entropy
Distribution 1: A fair die, for .
Step by step:
Distribution 2: A loaded die with and for .
Distribution 3: A degenerate die that always shows 6. , all others .
The fair die has the highest entropy (most uncertain), and the degenerate die has zero entropy (no uncertainty at all). This is a general principle: among all distributions on outcomes, the uniform distribution has maximum entropy .
Comparing entropy levels across distributions:
graph LR A["Certain: 0 bits"] -->|"more uncertain"| B["Biased coin: 0.47 bits"] B -->|"more uncertain"| C["Loaded die: 2.16 bits"] C -->|"more uncertain"| D["Fair die: 2.58 bits"]
Entropy comparison across different distributions:
KL divergence
The Kullback-Leibler divergence measures how different one distribution is from a reference distribution :
Think of it as: “How many extra bits do I need if I use distribution to encode data that actually follows distribution ?”
Key properties:
- always (Gibbs’ inequality)
- if and only if
- Not symmetric: in general
Because it is not symmetric, KL divergence is not a true “distance.” But it is still incredibly useful as a measure of distributional mismatch.
KL divergence is not symmetric:
graph LR P["Distribution P"] -->|"D_KL = 0.0253 nats"| Q["Distribution Q"] Q -->|"D_KL = 0.0259 nats"| P
Example 2: Computing KL divergence
Let and be distributions over three outcomes:
| A | 0.5 | 0.4 |
| B | 0.3 | 0.4 |
| C | 0.2 | 0.2 |
Compute :
We will use natural log (nats) here, as ML typically does.
Compute each term:
Now compute to see asymmetry:
The two directions give different values: 0.0253 vs 0.0259. Close in this case, but they can differ significantly for more different distributions.
Cross-entropy
Cross-entropy combines entropy and KL divergence:
The relationship is:
Since is fixed (it depends only on the true distribution), minimizing the cross-entropy with respect to is the same as minimizing the KL divergence. This is why cross-entropy is the loss function of choice: minimizing cross-entropy loss makes your model’s distribution as close as possible to the true distribution .
Cross-entropy in the ML training pipeline:
graph LR A["True labels P"] --> D["Cross-entropy loss H(P,Q)"] B["Model predictions Q"] --> D D --> E["Loss value"] E --> F["Backpropagate gradients"] F --> G["Update model weights"] G --> B
Example 3: Cross-entropy is at least as large as entropy
Using the distributions from Example 2:
Entropy of :
Cross-entropy :
Verify the relationship:
(Tiny difference due to rounding.)
And since , we always have . Cross-entropy is minimized when , at which point cross-entropy equals entropy. This is the theoretical lower bound.
Cross-entropy loss in classification
For a classifier with classes, the true label is a one-hot vector (all zeros except a 1 for the correct class) and the model outputs probabilities :
Since is one-hot (say class is correct, so and all others are 0):
This is just the negative log of the model’s predicted probability for the correct class. If the model is confident and correct (), the loss is near 0. If the model assigns low probability to the correct class (), the loss blows up toward infinity.
For binary classification with and predicted probability :
This is the binary cross-entropy (log loss). It is the standard loss for logistic regression.
Connection to maximum likelihood
Cross-entropy loss and maximum likelihood estimation (MLE) are the same thing in disguise.
Given data points and a model with parameters that outputs , the log-likelihood is:
Maximizing log-likelihood is equivalent to minimizing:
That is exactly the average cross-entropy between the empirical data distribution and the model’s predictions. So when you minimize cross-entropy loss, you are doing MLE.
Add a prior and you get MAP estimation, which corresponds to cross-entropy plus regularization.
When to use which loss
| Loss | Formula | When to use |
|---|---|---|
| Cross-entropy | Classification (softmax output) | |
| Binary cross-entropy | Binary classification | |
| Mean squared error | Regression (Gaussian likelihood) |
MSE corresponds to MLE under Gaussian noise assumptions. Cross-entropy corresponds to MLE for categorical distributions.
Entropy in ML beyond loss functions
Entropy shows up in several other places:
- Decision trees use entropy (or the related Gini impurity) to choose splits. A split is good if it reduces the entropy of the label distribution in each child node.
- Mutual information measures how much knowing reduces uncertainty about . It is used in feature selection.
- The Gaussian has maximum entropy among all distributions with a given mean and variance. This justifies assuming Gaussian noise when you have no other information; it is the “least committed” assumption.
- Variational autoencoders use KL divergence to regularize the latent space, pushing the encoder’s distribution toward a standard Gaussian.
Python: computing entropy and cross-entropy
import numpy as np
def entropy(p):
"""Shannon entropy in nats."""
p = np.array(p)
# Avoid log(0) by filtering
mask = p > 0
return -np.sum(p[mask] * np.log(p[mask]))
def cross_entropy(p, q):
"""Cross-entropy H(p, q) in nats."""
p, q = np.array(p), np.array(q)
mask = p > 0
return -np.sum(p[mask] * np.log(q[mask]))
def kl_divergence(p, q):
"""KL divergence D_KL(P || Q) in nats."""
return cross_entropy(p, q) - entropy(p)
P = [0.5, 0.3, 0.2]
Q = [0.4, 0.4, 0.2]
print(f"H(P) = {entropy(P):.4f} nats")
print(f"H(P, Q) = {cross_entropy(P, Q):.4f} nats")
print(f"D_KL(P || Q) = {kl_divergence(P, Q):.4f} nats")
print(f"H(P) + D_KL = {entropy(P) + kl_divergence(P, Q):.4f} nats")
print(f"Matches H(P,Q)? {np.isclose(cross_entropy(P, Q), entropy(P) + kl_divergence(P, Q))}")
Example 4: Binary cross-entropy loss calculation
A binary classifier predicts for a sample where the true label is .
Now suppose the model predicts for a sample with .
And for :
The loss penalizes confident wrong predictions much more severely than uncertain ones. This is a desirable property because it forces the model to be honest about what it does not know.
Summary
| Concept | Formula | Intuition |
|---|---|---|
| Information | Surprise of an event | |
| Entropy | Average surprise | |
| KL divergence | Extra cost of using instead of | |
| Cross-entropy | Total cost of using to encode | |
| Key relation | Minimizing cross-entropy = minimizing KL |
The thread connecting everything: minimizing cross-entropy loss trains your model to match the true data distribution as closely as possible. That is why it is the default loss for classification.
What comes next
This concludes the probability and information theory section of the Maths for ML series. The next part of the series shifts to optimization, starting with what is optimization, where we formalize the problem of finding the best parameters for a model.