Every ML practitioner uses cross-entropy loss. Few understand where it comes from. This tutorial builds it from first principles: surprise, entropy, and the cost of being wrong.
Surprise: The Atomic Unit
Building block
Start with a single event. If something unlikely happens, you're surprised. If something certain happens, you're not. We quantify this:
import math def surprise(p): return -math.log2(p) surprise(1.0) # 0.0 bits (certain) surprise(0.5) # 1.0 bit (coin flip) surprise(0.125) # 3.0 bits (1 in 8)
Entropy: Average Surprise
Uncertainty of a distribution
Entropy is the expected surprise across all outcomes. Weight each outcome's surprise by its probability:
def entropy(probs): H = 0 for p in probs: if p > 0: H += p * surprise(p) return H entropy([0.5, 0.5]) # 1.0 bit (max for 2) entropy([0.9, 0.1]) # 0.47 bits (lower)
Entropy is maximized when all outcomes are equally likely. Any deviation from uniform reduces entropy.
Cross-Entropy: Wrong Model Surprise
Where loss functions come from
We don't know the true distribution P. We have a model Q. Cross-entropy measures the average surprise when the world follows P but we use Q:
Classification: One-Hot Labels
In classification, P is one-hot: all probability on true class k. Cross-entropy simplifies:
KL Divergence: Cost of Being Wrong
Extra bits from model error
KL divergence is cross-entropy minus entropy. It measures extra surprise from model error:
In classification, H(P) is fixed (0 for one-hot). So minimizing H(P,Q) = minimizing KL. We push Q toward P.
Binary Cross-Entropy
Sigmoid output
For binary classification with output p = σ(z):
The Takeaway
I = -log(p) # surprise H_P = sum(p * -log(p) for p in P) # entropy H_PQ = sum(p * -log(q) for p,q in zip(P,Q)) # cross-entropy KL = H_PQ - H_P # divergence # Classification: P is one-hot, H_P = 0 # Loss = H_PQ = -log(q_correct) = KL
Cross-entropy loss measures how surprised we are by true labels using our model's predictions. Minimizing it pushes our model toward reality.