Mar 4, 2026 · 18 min read · Maths for ML

Information theory: entropy, KL divergence, cross-entropy

In this series (15 parts)

Cross-entropy loss is the default loss function for classification in ML. But where does it come from? The answer lies in information theory, a field that Claude Shannon created in 1948 to study communication. The core ideas, entropy, KL divergence, and cross-entropy, turn out to be exactly what you need to measure how well a model’s predicted distribution matches the true distribution of the data.

Prerequisites

You should be familiar with random variables, distributions, and expectation before reading this.

Information content (surprise)

Start with a single event. If an event has probability $p$ , its information content (or “surprise”) is:

I(p) = -\log_2(p) \quad \text{(in bits)}

Why negative log? Three reasons:

Certain events ( $p = 1$ ) carry zero information: $-\log_2(1) = 0$
Rare events carry more information: as $p \to 0$ , $I(p) \to \infty$
Independent events have additive information: $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$

Example: A fair coin flip has $p = 0.5$ , so $I = -\log_2(0.5) = 1$ bit. That is one bit of information, which makes intuitive sense because you need exactly one binary digit to encode heads or tails.

If the coin is biased with $p(\text{heads}) = 0.9$ , then heads carries $-\log_2(0.9) = 0.152$ bits (not very surprising) and tails carries $-\log_2(0.1) = 3.32$ bits (very surprising).

Shannon entropy

Entropy is the expected information content, the average surprise:

H(X) = -\sum_{x} p(x) \log_2 p(x)

For continuous distributions, replace the sum with an integral (called “differential entropy”). In ML, we often use natural log ( $\ln$ ) instead of $\log_2$ , which gives entropy in “nats” instead of “bits.” The math is identical up to a constant factor.

Entropy measures the inherent uncertainty in a distribution. High entropy means the distribution is spread out (hard to predict). Low entropy means the distribution is concentrated (easy to predict).

Example 1: Computing entropy

Distribution 1: A fair die, $p(x) = 1/6$ for $x = 1, \ldots, 6$ .

H(X) = -\sum_{x=1}^{6} \frac{1}{6} \log_2 \frac{1}{6} = -6 \times \frac{1}{6} \times (-2.585) = 2.585 \text{ bits}

Step by step:

\log_2(1/6) = \log_2(1) - \log_2(6) = 0 - 2.585 = -2.585

H(X) = -6 \times \frac{1}{6} \times (-2.585) = 2.585

Distribution 2: A loaded die with $p(6) = 0.5$ and $p(x) = 0.1$ for $x = 1, \ldots, 5$ .

H(X) = -\left[5 \times 0.1 \times \log_2(0.1) + 0.5 \times \log_2(0.5)\right]

= -\left[5 \times 0.1 \times (-3.322) + 0.5 \times (-1.0)\right]

= -\left[-1.661 + (-0.5)\right]

= -(-2.161) = 2.161 \text{ bits}

Distribution 3: A degenerate die that always shows 6. $p(6) = 1$ , all others $= 0$ .

H(X) = -[1 \times \log_2(1)] = 0 \text{ bits}

The fair die has the highest entropy (most uncertain), and the degenerate die has zero entropy (no uncertainty at all). This is a general principle: among all distributions on $n$ outcomes, the uniform distribution has maximum entropy $\log_2(n)$ .

Comparing entropy levels across distributions:

graph LR
  A["Certain: 0 bits"] -->|"more uncertain"| B["Biased coin: 0.47 bits"]
  B -->|"more uncertain"| C["Loaded die: 2.16 bits"]
  C -->|"more uncertain"| D["Fair die: 2.58 bits"]

Entropy comparison across different distributions:

KL divergence

The Kullback-Leibler divergence measures how different one distribution $Q$ is from a reference distribution $P$ :

D_{\text{KL}}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}

Think of it as: “How many extra bits do I need if I use distribution $Q$ to encode data that actually follows distribution $P$ ?”

Key properties:

$D_{\text{KL}}(P \| Q) \geq 0$ always (Gibbs’ inequality)
$D_{\text{KL}}(P \| Q) = 0$ if and only if $P = Q$
Not symmetric: $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ in general

Because it is not symmetric, KL divergence is not a true “distance.” But it is still incredibly useful as a measure of distributional mismatch.

KL divergence is not symmetric:

graph LR
  P["Distribution P"] -->|"D_KL = 0.0253 nats"| Q["Distribution Q"]
  Q -->|"D_KL = 0.0259 nats"| P

Example 2: Computing KL divergence

Let $P$ and $Q$ be distributions over three outcomes:

$x$	$p(x)$	$q(x)$
A	0.5	0.4
B	0.3	0.4
C	0.2	0.2

Compute $D_{\text{KL}}(P \| Q)$ :

D_{\text{KL}}(P \| Q) = \sum_x p(x) \ln \frac{p(x)}{q(x)}

We will use natural log (nats) here, as ML typically does.

= 0.5 \ln\frac{0.5}{0.4} + 0.3 \ln\frac{0.3}{0.4} + 0.2 \ln\frac{0.2}{0.2}

Compute each term:

0.5 \times \ln(1.25) = 0.5 \times 0.2231 = 0.1116

0.3 \times \ln(0.75) = 0.3 \times (-0.2877) = -0.0863

0.2 \times \ln(1.0) = 0.2 \times 0 = 0

D_{\text{KL}}(P \| Q) = 0.1116 + (-0.0863) + 0 = 0.0253 \text{ nats}

Now compute $D_{\text{KL}}(Q \| P)$ to see asymmetry:

= 0.4 \ln\frac{0.4}{0.5} + 0.4 \ln\frac{0.4}{0.3} + 0.2 \ln\frac{0.2}{0.2}

= 0.4 \times (-0.2231) + 0.4 \times 0.2877 + 0

= -0.0892 + 0.1151 + 0 = 0.0259 \text{ nats}

The two directions give different values: 0.0253 vs 0.0259. Close in this case, but they can differ significantly for more different distributions.

Cross-entropy

Cross-entropy combines entropy and KL divergence:

H(P, Q) = -\sum_{x} p(x) \log q(x)

The relationship is:

H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)

Since $H(P)$ is fixed (it depends only on the true distribution), minimizing the cross-entropy $H(P, Q)$ with respect to $Q$ is the same as minimizing the KL divergence. This is why cross-entropy is the loss function of choice: minimizing cross-entropy loss makes your model’s distribution $Q$ as close as possible to the true distribution $P$ .

Cross-entropy in the ML training pipeline:

graph LR
  A["True labels P"] --> D["Cross-entropy loss H(P,Q)"]
  B["Model predictions Q"] --> D
  D --> E["Loss value"]
  E --> F["Backpropagate gradients"]
  F --> G["Update model weights"]
  G --> B

Example 3: Cross-entropy is at least as large as entropy

Using the distributions from Example 2:

Entropy of $P$ :

H(P) = -[0.5 \ln(0.5) + 0.3 \ln(0.3) + 0.2 \ln(0.2)]

= -[0.5 \times (-0.6931) + 0.3 \times (-1.2040) + 0.2 \times (-1.6094)]

= -[-0.3466 + (-0.3612) + (-0.3219)]

= -(-1.0297) = 1.0297 \text{ nats}

Cross-entropy $H(P, Q)$ :

H(P, Q) = -[0.5 \ln(0.4) + 0.3 \ln(0.4) + 0.2 \ln(0.2)]

= -[0.5 \times (-0.9163) + 0.3 \times (-0.9163) + 0.2 \times (-1.6094)]

= -[-0.4581 + (-0.2749) + (-0.3219)]

= -(-1.0549) = 1.0549 \text{ nats}

Verify the relationship:

H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)

1.0549 \approx 1.0297 + 0.0253 = 1.0550 \quad \checkmark

(Tiny difference due to rounding.)

And since $D_{\text{KL}} \geq 0$ , we always have $H(P, Q) \geq H(P)$ . Cross-entropy is minimized when $Q = P$ , at which point cross-entropy equals entropy. This is the theoretical lower bound.

Cross-entropy loss in classification

For a classifier with $K$ classes, the true label is a one-hot vector $\mathbf{y}$ (all zeros except a 1 for the correct class) and the model outputs probabilities $\hat{\mathbf{y}}$ :

L = -\sum_{k=1}^{K} y_k \log \hat{y}_k

Since $\mathbf{y}$ is one-hot (say class $c$ is correct, so $y_c = 1$ and all others are 0):

L = -\log \hat{y}_c

This is just the negative log of the model’s predicted probability for the correct class. If the model is confident and correct ( $\hat{y}_c \approx 1$ ), the loss is near 0. If the model assigns low probability to the correct class ( $\hat{y}_c \approx 0$ ), the loss blows up toward infinity.

For binary classification with $y \in \{0, 1\}$ and predicted probability $\hat{p}$ :

L = -[y \log \hat{p} + (1 - y) \log(1 - \hat{p})]

This is the binary cross-entropy (log loss). It is the standard loss for logistic regression.

Connection to maximum likelihood

Cross-entropy loss and maximum likelihood estimation (MLE) are the same thing in disguise.

Given data points $(x_1, y_1), \ldots, (x_n, y_n)$ and a model with parameters $\theta$ that outputs $P(y \mid x; \theta)$ , the log-likelihood is:

\log L(\theta) = \sum_{i=1}^{n} \log P(y_i \mid x_i; \theta)

Maximizing log-likelihood is equivalent to minimizing:

-\frac{1}{n} \sum_{i=1}^{n} \log P(y_i \mid x_i; \theta)

That is exactly the average cross-entropy between the empirical data distribution and the model’s predictions. So when you minimize cross-entropy loss, you are doing MLE.

Add a prior $P(\theta)$ and you get MAP estimation, which corresponds to cross-entropy plus regularization.

When to use which loss

Loss	Formula	When to use
Cross-entropy	$-\sum y_k \log \hat{y}_k$	Classification (softmax output)
Binary cross-entropy	$-[y \log \hat{p} + (1-y)\log(1-\hat{p})]$	Binary classification
Mean squared error	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$	Regression (Gaussian likelihood)

MSE corresponds to MLE under Gaussian noise assumptions. Cross-entropy corresponds to MLE for categorical distributions.

Entropy in ML beyond loss functions

Entropy shows up in several other places:

Decision trees use entropy (or the related Gini impurity) to choose splits. A split is good if it reduces the entropy of the label distribution in each child node.
Mutual information $I(X; Y) = H(X) - H(X \mid Y)$ measures how much knowing $Y$ reduces uncertainty about $X$ . It is used in feature selection.
The Gaussian has maximum entropy among all distributions with a given mean and variance. This justifies assuming Gaussian noise when you have no other information; it is the “least committed” assumption.
Variational autoencoders use KL divergence to regularize the latent space, pushing the encoder’s distribution toward a standard Gaussian.

Python: computing entropy and cross-entropy

import numpy as np

def entropy(p):
    """Shannon entropy in nats."""
    p = np.array(p)
    # Avoid log(0) by filtering
    mask = p > 0
    return -np.sum(p[mask] * np.log(p[mask]))

def cross_entropy(p, q):
    """Cross-entropy H(p, q) in nats."""
    p, q = np.array(p), np.array(q)
    mask = p > 0
    return -np.sum(p[mask] * np.log(q[mask]))

def kl_divergence(p, q):
    """KL divergence D_KL(P || Q) in nats."""
    return cross_entropy(p, q) - entropy(p)

P = [0.5, 0.3, 0.2]
Q = [0.4, 0.4, 0.2]

print(f"H(P)         = {entropy(P):.4f} nats")
print(f"H(P, Q)      = {cross_entropy(P, Q):.4f} nats")
print(f"D_KL(P || Q) = {kl_divergence(P, Q):.4f} nats")
print(f"H(P) + D_KL  = {entropy(P) + kl_divergence(P, Q):.4f} nats")
print(f"Matches H(P,Q)? {np.isclose(cross_entropy(P, Q), entropy(P) + kl_divergence(P, Q))}")

Example 4: Binary cross-entropy loss calculation

A binary classifier predicts $\hat{p} = 0.8$ for a sample where the true label is $y = 1$ .

L = -[1 \times \ln(0.8) + 0 \times \ln(0.2)] = -\ln(0.8) = -(-0.2231) = 0.2231

Now suppose the model predicts $\hat{p} = 0.3$ for a sample with $y = 1$ .

L = -\ln(0.3) = -(-1.2040) = 1.2040

And $\hat{p} = 0.95$ for $y = 0$ :

L = -[0 \times \ln(0.95) + 1 \times \ln(0.05)] = -\ln(0.05) = 2.9957

The loss penalizes confident wrong predictions much more severely than uncertain ones. This is a desirable property because it forces the model to be honest about what it does not know.

Summary

Concept	Formula	Intuition
Information	$-\log p(x)$	Surprise of an event
Entropy	$-\sum p(x) \log p(x)$	Average surprise
KL divergence	$\sum p(x) \log \frac{p(x)}{q(x)}$	Extra cost of using $Q$ instead of $P$
Cross-entropy	$-\sum p(x) \log q(x)$	Total cost of using $Q$ to encode $P$
Key relation	$H(P, Q) = H(P) + D_{\text{KL}}(P \\| Q)$	Minimizing cross-entropy = minimizing KL

The thread connecting everything: minimizing cross-entropy loss trains your model to match the true data distribution as closely as possible. That is why it is the default loss for classification.

What comes next

This concludes the probability and information theory section of the Maths for ML series. The next part of the series shifts to optimization, starting with what is optimization, where we formalize the problem of finding the best parameters for a model.

← Back to all series