Information Theory for ML

Every ML practitioner uses cross-entropy loss. Few understand where it comes from. This tutorial builds it from first principles: surprise, entropy, and the cost of being wrong.

Surprise: The Atomic Unit

Building block

Start with a single event. If something unlikely happens, you're surprised. If something certain happens, you're not. We quantify this:

Surprise (Self-Information) $$I(x) = -\log_2 P(x)$$

The negative log of probability. Rare events (low P) give high surprise. Certain events (P=1) give zero surprise.

surprise.pypython

import math

def surprise(p):
    return -math.log2(p)

surprise(1.0)   # 0.0 bits (certain)
surprise(0.5)   # 1.0 bit  (coin flip)
surprise(0.125) # 3.0 bits (1 in 8)

Surprise vs Probability

P(x): 0.50

Surprise: 1.00 bits

Entropy: Average Surprise

Uncertainty of a distribution

Entropy is the expected surprise across all outcomes. Weight each outcome's surprise by its probability:

Entropy $$H(P) = \mathbb{E}[I(x)] = -\sum_x P(x) \log P(x)$$

High entropy = high uncertainty. Low entropy = concentrated probability.

entropy.pypython

def entropy(probs):
    H = 0
    for p in probs:
        if p > 0: H += p * surprise(p)
    return H

entropy([0.5, 0.5])  # 1.0 bit  (max for 2)
entropy([0.9, 0.1])  # 0.47 bits (lower)

Entropy of a Distribution

Outcomes:

Entropy: 2.58 bits

Max possible: 2.58 bits

Key Insight

Entropy is maximized when all outcomes are equally likely. Any deviation from uniform reduces entropy.

Cross-Entropy: Wrong Model Surprise

Where loss functions come from

We don't know the true distribution P. We have a model Q. Cross-entropy measures the average surprise when the world follows P but we use Q:

Cross-Entropy $$H(P, Q) = -\sum_x P(x) \log Q(x)$$

Sample from P (reality), measure surprise using Q (model). If Q = P, cross-entropy equals entropy.

Classification: One-Hot Labels

In classification, P is one-hot: all probability on true class k. Cross-entropy simplifies:

Classification Loss $$H(P, Q) = -\log Q(k)$$

Since P(k)=1 and P(j≠k)=0, only the true class term survives.

Cross-Entropy Loss

Q(correct): 0.70

Loss: 0.36

KL Divergence: Cost of Being Wrong

Extra bits from model error

KL divergence is cross-entropy minus entropy. It measures extra surprise from model error:

KL Divergence $$D_{KL}(P \| Q) = H(P, Q) - H(P) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$

If Q = P, KL = 0. Any mismatch makes KL positive. Never negative.

KL Divergence (Binary)

True P: 0.50

Model Q: 0.80

H(P):0.69

H(P,Q):0.80

KL:0.11

Why Minimize Cross-Entropy?

In classification, H(P) is fixed (0 for one-hot). So minimizing H(P,Q) = minimizing KL. We push Q toward P.

Binary Cross-Entropy

Sigmoid output

For binary classification with output p = σ(z):

Binary Cross-Entropy $$L = -[y \log(p) + (1-y) \log(1-p)]$$

If y=1, want p high. If y=0, want 1-p high. Penalizes confident wrong predictions.

BCE Loss

Logit z: 1.00

σ(z):0.73

Loss:0.31

The Takeaway

summary.pypython

I    = -log(p)                          # surprise
H_P  = sum(p * -log(p) for p in P)   # entropy
H_PQ = sum(p * -log(q) for p,q in zip(P,Q)) # cross-entropy
KL   = H_PQ - H_P                       # divergence

# Classification: P is one-hot, H_P = 0
# Loss = H_PQ = -log(q_correct) = KL

Cross-entropy loss measures how surprised we are by true labels using our model's predictions. Minimizing it pushes our model toward reality.