Search…
Maths for ML · Part 15

Information theory: entropy, KL divergence, cross-entropy

In this series (15 parts)
  1. Why Maths Matters for ML: A Practical Overview
  2. Scalars, Vectors, and Vector Spaces
  3. Matrices and Matrix Operations
  4. Matrix Inverses and Systems of Linear Equations
  5. Eigenvalues and Eigenvectors
  6. Matrix Decompositions: LU, QR, SVD
  7. Norms, Distances, and Similarity
  8. Calculus Review: Derivatives and the Chain Rule
  9. Partial Derivatives and Gradients
  10. The Jacobian and Hessian Matrices
  11. Taylor series and local approximations
  12. Probability fundamentals
  13. Random variables and distributions
  14. Bayes theorem and its role in ML
  15. Information theory: entropy, KL divergence, cross-entropy

Cross-entropy loss is the default loss function for classification in ML. But where does it come from? The answer lies in information theory, a field that Claude Shannon created in 1948 to study communication. The core ideas, entropy, KL divergence, and cross-entropy, turn out to be exactly what you need to measure how well a model’s predicted distribution matches the true distribution of the data.

Prerequisites

You should be familiar with random variables, distributions, and expectation before reading this.

Information content (surprise)

Start with a single event. If an event has probability pp, its information content (or “surprise”) is:

I(p)=log2(p)(in bits)I(p) = -\log_2(p) \quad \text{(in bits)}

Why negative log? Three reasons:

  1. Certain events (p=1p = 1) carry zero information: log2(1)=0-\log_2(1) = 0
  2. Rare events carry more information: as p0p \to 0, I(p)I(p) \to \infty
  3. Independent events have additive information: I(p1p2)=I(p1)+I(p2)I(p_1 \cdot p_2) = I(p_1) + I(p_2)

Example: A fair coin flip has p=0.5p = 0.5, so I=log2(0.5)=1I = -\log_2(0.5) = 1 bit. That is one bit of information, which makes intuitive sense because you need exactly one binary digit to encode heads or tails.

If the coin is biased with p(heads)=0.9p(\text{heads}) = 0.9, then heads carries log2(0.9)=0.152-\log_2(0.9) = 0.152 bits (not very surprising) and tails carries log2(0.1)=3.32-\log_2(0.1) = 3.32 bits (very surprising).

Shannon entropy

Entropy is the expected information content, the average surprise:

H(X)=xp(x)log2p(x)H(X) = -\sum_{x} p(x) \log_2 p(x)

For continuous distributions, replace the sum with an integral (called “differential entropy”). In ML, we often use natural log (ln\ln) instead of log2\log_2, which gives entropy in “nats” instead of “bits.” The math is identical up to a constant factor.

Entropy measures the inherent uncertainty in a distribution. High entropy means the distribution is spread out (hard to predict). Low entropy means the distribution is concentrated (easy to predict).

Example 1: Computing entropy

Distribution 1: A fair die, p(x)=1/6p(x) = 1/6 for x=1,,6x = 1, \ldots, 6.

H(X)=x=1616log216=6×16×(2.585)=2.585 bitsH(X) = -\sum_{x=1}^{6} \frac{1}{6} \log_2 \frac{1}{6} = -6 \times \frac{1}{6} \times (-2.585) = 2.585 \text{ bits}

Step by step:

log2(1/6)=log2(1)log2(6)=02.585=2.585\log_2(1/6) = \log_2(1) - \log_2(6) = 0 - 2.585 = -2.585 H(X)=6×16×(2.585)=2.585H(X) = -6 \times \frac{1}{6} \times (-2.585) = 2.585

Distribution 2: A loaded die with p(6)=0.5p(6) = 0.5 and p(x)=0.1p(x) = 0.1 for x=1,,5x = 1, \ldots, 5.

H(X)=[5×0.1×log2(0.1)+0.5×log2(0.5)]H(X) = -\left[5 \times 0.1 \times \log_2(0.1) + 0.5 \times \log_2(0.5)\right] =[5×0.1×(3.322)+0.5×(1.0)]= -\left[5 \times 0.1 \times (-3.322) + 0.5 \times (-1.0)\right] =[1.661+(0.5)]= -\left[-1.661 + (-0.5)\right] =(2.161)=2.161 bits= -(-2.161) = 2.161 \text{ bits}

Distribution 3: A degenerate die that always shows 6. p(6)=1p(6) = 1, all others =0= 0.

H(X)=[1×log2(1)]=0 bitsH(X) = -[1 \times \log_2(1)] = 0 \text{ bits}

The fair die has the highest entropy (most uncertain), and the degenerate die has zero entropy (no uncertainty at all). This is a general principle: among all distributions on nn outcomes, the uniform distribution has maximum entropy log2(n)\log_2(n).

Comparing entropy levels across distributions:

graph LR
  A["Certain: 0 bits"] -->|"more uncertain"| B["Biased coin: 0.47 bits"]
  B -->|"more uncertain"| C["Loaded die: 2.16 bits"]
  C -->|"more uncertain"| D["Fair die: 2.58 bits"]

Entropy comparison across different distributions:

KL divergence

The Kullback-Leibler divergence measures how different one distribution QQ is from a reference distribution PP:

DKL(PQ)=xp(x)logp(x)q(x)D_{\text{KL}}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}

Think of it as: “How many extra bits do I need if I use distribution QQ to encode data that actually follows distribution PP?”

Key properties:

  • DKL(PQ)0D_{\text{KL}}(P \| Q) \geq 0 always (Gibbs’ inequality)
  • DKL(PQ)=0D_{\text{KL}}(P \| Q) = 0 if and only if P=QP = Q
  • Not symmetric: DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) in general

Because it is not symmetric, KL divergence is not a true “distance.” But it is still incredibly useful as a measure of distributional mismatch.

KL divergence is not symmetric:

graph LR
  P["Distribution P"] -->|"D_KL = 0.0253 nats"| Q["Distribution Q"]
  Q -->|"D_KL = 0.0259 nats"| P

Example 2: Computing KL divergence

Let PP and QQ be distributions over three outcomes:

xxp(x)p(x)q(x)q(x)
A0.50.4
B0.30.4
C0.20.2

Compute DKL(PQ)D_{\text{KL}}(P \| Q):

DKL(PQ)=xp(x)lnp(x)q(x)D_{\text{KL}}(P \| Q) = \sum_x p(x) \ln \frac{p(x)}{q(x)}

We will use natural log (nats) here, as ML typically does.

=0.5ln0.50.4+0.3ln0.30.4+0.2ln0.20.2= 0.5 \ln\frac{0.5}{0.4} + 0.3 \ln\frac{0.3}{0.4} + 0.2 \ln\frac{0.2}{0.2}

Compute each term:

0.5×ln(1.25)=0.5×0.2231=0.11160.5 \times \ln(1.25) = 0.5 \times 0.2231 = 0.1116 0.3×ln(0.75)=0.3×(0.2877)=0.08630.3 \times \ln(0.75) = 0.3 \times (-0.2877) = -0.0863 0.2×ln(1.0)=0.2×0=00.2 \times \ln(1.0) = 0.2 \times 0 = 0 DKL(PQ)=0.1116+(0.0863)+0=0.0253 natsD_{\text{KL}}(P \| Q) = 0.1116 + (-0.0863) + 0 = 0.0253 \text{ nats}

Now compute DKL(QP)D_{\text{KL}}(Q \| P) to see asymmetry:

=0.4ln0.40.5+0.4ln0.40.3+0.2ln0.20.2= 0.4 \ln\frac{0.4}{0.5} + 0.4 \ln\frac{0.4}{0.3} + 0.2 \ln\frac{0.2}{0.2} =0.4×(0.2231)+0.4×0.2877+0= 0.4 \times (-0.2231) + 0.4 \times 0.2877 + 0 =0.0892+0.1151+0=0.0259 nats= -0.0892 + 0.1151 + 0 = 0.0259 \text{ nats}

The two directions give different values: 0.0253 vs 0.0259. Close in this case, but they can differ significantly for more different distributions.

Cross-entropy

Cross-entropy combines entropy and KL divergence:

H(P,Q)=xp(x)logq(x)H(P, Q) = -\sum_{x} p(x) \log q(x)

The relationship is:

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)

Since H(P)H(P) is fixed (it depends only on the true distribution), minimizing the cross-entropy H(P,Q)H(P, Q) with respect to QQ is the same as minimizing the KL divergence. This is why cross-entropy is the loss function of choice: minimizing cross-entropy loss makes your model’s distribution QQ as close as possible to the true distribution PP.

Cross-entropy in the ML training pipeline:

graph LR
  A["True labels P"] --> D["Cross-entropy loss H(P,Q)"]
  B["Model predictions Q"] --> D
  D --> E["Loss value"]
  E --> F["Backpropagate gradients"]
  F --> G["Update model weights"]
  G --> B

Example 3: Cross-entropy is at least as large as entropy

Using the distributions from Example 2:

Entropy of PP:

H(P)=[0.5ln(0.5)+0.3ln(0.3)+0.2ln(0.2)]H(P) = -[0.5 \ln(0.5) + 0.3 \ln(0.3) + 0.2 \ln(0.2)] =[0.5×(0.6931)+0.3×(1.2040)+0.2×(1.6094)]= -[0.5 \times (-0.6931) + 0.3 \times (-1.2040) + 0.2 \times (-1.6094)] =[0.3466+(0.3612)+(0.3219)]= -[-0.3466 + (-0.3612) + (-0.3219)] =(1.0297)=1.0297 nats= -(-1.0297) = 1.0297 \text{ nats}

Cross-entropy H(P,Q)H(P, Q):

H(P,Q)=[0.5ln(0.4)+0.3ln(0.4)+0.2ln(0.2)]H(P, Q) = -[0.5 \ln(0.4) + 0.3 \ln(0.4) + 0.2 \ln(0.2)] =[0.5×(0.9163)+0.3×(0.9163)+0.2×(1.6094)]= -[0.5 \times (-0.9163) + 0.3 \times (-0.9163) + 0.2 \times (-1.6094)] =[0.4581+(0.2749)+(0.3219)]= -[-0.4581 + (-0.2749) + (-0.3219)] =(1.0549)=1.0549 nats= -(-1.0549) = 1.0549 \text{ nats}

Verify the relationship:

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{\text{KL}}(P \| Q) 1.05491.0297+0.0253=1.05501.0549 \approx 1.0297 + 0.0253 = 1.0550 \quad \checkmark

(Tiny difference due to rounding.)

And since DKL0D_{\text{KL}} \geq 0, we always have H(P,Q)H(P)H(P, Q) \geq H(P). Cross-entropy is minimized when Q=PQ = P, at which point cross-entropy equals entropy. This is the theoretical lower bound.

Cross-entropy loss in classification

For a classifier with KK classes, the true label is a one-hot vector y\mathbf{y} (all zeros except a 1 for the correct class) and the model outputs probabilities y^\hat{\mathbf{y}}:

L=k=1Kyklogy^kL = -\sum_{k=1}^{K} y_k \log \hat{y}_k

Since y\mathbf{y} is one-hot (say class cc is correct, so yc=1y_c = 1 and all others are 0):

L=logy^cL = -\log \hat{y}_c

This is just the negative log of the model’s predicted probability for the correct class. If the model is confident and correct (y^c1\hat{y}_c \approx 1), the loss is near 0. If the model assigns low probability to the correct class (y^c0\hat{y}_c \approx 0), the loss blows up toward infinity.

For binary classification with y{0,1}y \in \{0, 1\} and predicted probability p^\hat{p}:

L=[ylogp^+(1y)log(1p^)]L = -[y \log \hat{p} + (1 - y) \log(1 - \hat{p})]

This is the binary cross-entropy (log loss). It is the standard loss for logistic regression.

Connection to maximum likelihood

Cross-entropy loss and maximum likelihood estimation (MLE) are the same thing in disguise.

Given data points (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n) and a model with parameters θ\theta that outputs P(yx;θ)P(y \mid x; \theta), the log-likelihood is:

logL(θ)=i=1nlogP(yixi;θ)\log L(\theta) = \sum_{i=1}^{n} \log P(y_i \mid x_i; \theta)

Maximizing log-likelihood is equivalent to minimizing:

1ni=1nlogP(yixi;θ)-\frac{1}{n} \sum_{i=1}^{n} \log P(y_i \mid x_i; \theta)

That is exactly the average cross-entropy between the empirical data distribution and the model’s predictions. So when you minimize cross-entropy loss, you are doing MLE.

Add a prior P(θ)P(\theta) and you get MAP estimation, which corresponds to cross-entropy plus regularization.

When to use which loss

LossFormulaWhen to use
Cross-entropyyklogy^k-\sum y_k \log \hat{y}_kClassification (softmax output)
Binary cross-entropy[ylogp^+(1y)log(1p^)]-[y \log \hat{p} + (1-y)\log(1-\hat{p})]Binary classification
Mean squared error1n(yiy^i)2\frac{1}{n}\sum(y_i - \hat{y}_i)^2Regression (Gaussian likelihood)

MSE corresponds to MLE under Gaussian noise assumptions. Cross-entropy corresponds to MLE for categorical distributions.

Entropy in ML beyond loss functions

Entropy shows up in several other places:

  • Decision trees use entropy (or the related Gini impurity) to choose splits. A split is good if it reduces the entropy of the label distribution in each child node.
  • Mutual information I(X;Y)=H(X)H(XY)I(X; Y) = H(X) - H(X \mid Y) measures how much knowing YY reduces uncertainty about XX. It is used in feature selection.
  • The Gaussian has maximum entropy among all distributions with a given mean and variance. This justifies assuming Gaussian noise when you have no other information; it is the “least committed” assumption.
  • Variational autoencoders use KL divergence to regularize the latent space, pushing the encoder’s distribution toward a standard Gaussian.

Python: computing entropy and cross-entropy

import numpy as np

def entropy(p):
    """Shannon entropy in nats."""
    p = np.array(p)
    # Avoid log(0) by filtering
    mask = p > 0
    return -np.sum(p[mask] * np.log(p[mask]))

def cross_entropy(p, q):
    """Cross-entropy H(p, q) in nats."""
    p, q = np.array(p), np.array(q)
    mask = p > 0
    return -np.sum(p[mask] * np.log(q[mask]))

def kl_divergence(p, q):
    """KL divergence D_KL(P || Q) in nats."""
    return cross_entropy(p, q) - entropy(p)

P = [0.5, 0.3, 0.2]
Q = [0.4, 0.4, 0.2]

print(f"H(P)         = {entropy(P):.4f} nats")
print(f"H(P, Q)      = {cross_entropy(P, Q):.4f} nats")
print(f"D_KL(P || Q) = {kl_divergence(P, Q):.4f} nats")
print(f"H(P) + D_KL  = {entropy(P) + kl_divergence(P, Q):.4f} nats")
print(f"Matches H(P,Q)? {np.isclose(cross_entropy(P, Q), entropy(P) + kl_divergence(P, Q))}")

Example 4: Binary cross-entropy loss calculation

A binary classifier predicts p^=0.8\hat{p} = 0.8 for a sample where the true label is y=1y = 1.

L=[1×ln(0.8)+0×ln(0.2)]=ln(0.8)=(0.2231)=0.2231L = -[1 \times \ln(0.8) + 0 \times \ln(0.2)] = -\ln(0.8) = -(-0.2231) = 0.2231

Now suppose the model predicts p^=0.3\hat{p} = 0.3 for a sample with y=1y = 1.

L=ln(0.3)=(1.2040)=1.2040L = -\ln(0.3) = -(-1.2040) = 1.2040

And p^=0.95\hat{p} = 0.95 for y=0y = 0:

L=[0×ln(0.95)+1×ln(0.05)]=ln(0.05)=2.9957L = -[0 \times \ln(0.95) + 1 \times \ln(0.05)] = -\ln(0.05) = 2.9957

The loss penalizes confident wrong predictions much more severely than uncertain ones. This is a desirable property because it forces the model to be honest about what it does not know.

Summary

ConceptFormulaIntuition
Informationlogp(x)-\log p(x)Surprise of an event
Entropyp(x)logp(x)-\sum p(x) \log p(x)Average surprise
KL divergencep(x)logp(x)q(x)\sum p(x) \log \frac{p(x)}{q(x)}Extra cost of using QQ instead of PP
Cross-entropyp(x)logq(x)-\sum p(x) \log q(x)Total cost of using QQ to encode PP
Key relationH(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)Minimizing cross-entropy = minimizing KL

The thread connecting everything: minimizing cross-entropy loss trains your model to match the true data distribution as closely as possible. That is why it is the default loss for classification.

What comes next

This concludes the probability and information theory section of the Maths for ML series. The next part of the series shifts to optimization, starting with what is optimization, where we formalize the problem of finding the best parameters for a model.

Start typing to search across all content
navigate Enter open Esc close