Information Theory for Machine Learning

From bits to loss functions: why cross-entropy works

Every ML practitioner uses cross-entropy loss. Few understand where it comes from. This tutorial builds it from first principles: surprise, entropy, and the cost of being wrong.

Surprise: The Atomic Unit

Start with a single event. If something unlikely happens, you're surprised. If something certain happens, you're not. We quantify this:

Surprise (Self-Information) $$I(x) = -\log_2 P(x)$$
The negative log of probability. Rare events (low P) give high surprise. Certain events (P=1) give zero surprise.
surprise.pypython
import math

def surprise(p):
    return -math.log2(p)

surprise(1.0)   # 0.0 bits (certain)
surprise(0.5)   # 1.0 bit  (coin flip)
surprise(0.125) # 3.0 bits (1 in 8)
Surprise vs Probability
P(x): 0.50
Surprise: 1.00 bits

Entropy: Average Surprise

Entropy is the expected surprise across all outcomes. Weight each outcome's surprise by its probability:

Entropy $$H(P) = \mathbb{E}[I(x)] = -\sum_x P(x) \log P(x)$$
High entropy = high uncertainty. Low entropy = concentrated probability.
entropy.pypython
def entropy(probs):
    H = 0
    for p in probs:
        if p > 0: H += p * surprise(p)
    return H

entropy([0.5, 0.5])  # 1.0 bit  (max for 2)
entropy([0.9, 0.1])  # 0.47 bits (lower)
Entropy of a Distribution
Outcomes:
Entropy: 2.58 bits
Max possible: 2.58 bits
Key Insight

Entropy is maximized when all outcomes are equally likely. Any deviation from uniform reduces entropy.

Cross-Entropy: Wrong Model Surprise

We don't know the true distribution P. We have a model Q. Cross-entropy measures the average surprise when the world follows P but we use Q:

Cross-Entropy $$H(P, Q) = -\sum_x P(x) \log Q(x)$$
Sample from P (reality), measure surprise using Q (model). If Q = P, cross-entropy equals entropy.

Classification: One-Hot Labels

In classification, P is one-hot: all probability on true class k. Cross-entropy simplifies:

Classification Loss $$H(P, Q) = -\log Q(k)$$
Since P(k)=1 and P(j≠k)=0, only the true class term survives.
Cross-Entropy Loss
Q(correct): 0.70
Loss: 0.36

KL Divergence: Cost of Being Wrong

KL divergence is cross-entropy minus entropy. It measures extra surprise from model error:

KL Divergence $$D_{KL}(P \| Q) = H(P, Q) - H(P) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$
If Q = P, KL = 0. Any mismatch makes KL positive. Never negative.
KL Divergence (Binary)
True P: 0.50
Model Q: 0.80
H(P):0.69
H(P,Q):0.80
KL:0.11
Why Minimize Cross-Entropy?

In classification, H(P) is fixed (0 for one-hot). So minimizing H(P,Q) = minimizing KL. We push Q toward P.

Binary Cross-Entropy

For binary classification with output p = σ(z):

Binary Cross-Entropy $$L = -[y \log(p) + (1-y) \log(1-p)]$$
If y=1, want p high. If y=0, want 1-p high. Penalizes confident wrong predictions.
BCE Loss
Logit z: 1.00
σ(z):0.73
Loss:0.31

The Takeaway

summary.pypython
I    = -log(p)                          # surprise
H_P  = sum(p * -log(p) for p in P)   # entropy
H_PQ = sum(p * -log(q) for p,q in zip(P,Q)) # cross-entropy
KL   = H_PQ - H_P                       # divergence

# Classification: P is one-hot, H_P = 0
# Loss = H_PQ = -log(q_correct) = KL

Cross-entropy loss measures how surprised we are by true labels using our model's predictions. Minimizing it pushes our model toward reality.