Week 10 — Information Theory: Entropy, Compression, Channels, and Maximum Entropy
Overview
Information theory quantifies uncertainty and the limits of compression and communication, and it supplies the loss functions and regularizers used throughout machine learning. Cross-entropy is the standard training objective for classifiers and language models; KL divergence measures distribution mismatch; mutual information measures dependence; and the maximum-entropy principle explains why the Gaussian and the softmax/exponential family are the “default” distributions. This week ties the probability of Weeks 6–7 to the optimization of Weeks 8–9.
Readings
- C&T Ch 2 — Entropy, relative entropy, mutual information. Chain rules, Jensen’s inequality, the log-sum inequality.
- C&T Ch 3 — Asymptotic equipartition property. Typical sets, source-coding intuition.
- C&T Ch 4 — Entropy rates of a stochastic process. Markov chains, entropy rate.
- C&T Ch 5 — Data compression. Kraft inequality, Huffman coding, the source-coding theorem.
- C&T Ch 8 — Differential entropy. Continuous entropy, the Gaussian, maximum entropy.
- C&T Ch 10–12 — Rate distortion, information theory & statistics, maximum entropy.
Key Concepts
Entropy, relative entropy, mutual information
The entropy of a discrete distribution measures its uncertainty in bits:
\[H(X) = -\sum_x p(x)\log p(x).\]
The relative entropy (KL divergence) measures the cost of using \(q\) when the truth is \(p\):
\[D(p\,\|\,q) = \sum_x p(x)\log\frac{p(x)}{q(x)} \ge 0,\]
with equality iff \(p=q\) (Gibbs’ inequality, a consequence of Jensen). Cross-entropy \(H(p,q) = H(p) + D(p\|q)\) is the quantity minimized when training a classifier or language model. Mutual information \(I(X;Y) = H(X) - H(X\mid Y) = D(p_{XY}\|p_X p_Y)\) measures how much one variable tells you about another; it is zero iff \(X,Y\) are independent. Chain rules decompose joint entropy: \(H(X,Y) = H(X) + H(Y\mid X)\).
AEP, typical sets, entropy rate
The asymptotic equipartition property says that for \(n\) i.i.d. draws the probability of a typical sequence is \(\approx 2^{-nH(X)}\), so there are about \(2^{nH}\) typical sequences — this is why \(H\) is the fundamental compression rate. For a stationary process the entropy rate \(H(\mathcal{X}) = \lim \frac1n H(X_1,\dots,X_n)\) generalizes this (computable from the stationary distribution of a Markov chain, Week 7), and underlies the perplexity metric for language models.
Source coding and compression
The Kraft inequality characterizes which codeword lengths are achievable with a prefix code; the source-coding theorem says the minimum expected codeword length lies between \(H(X)\) and \(H(X)+1\). Huffman coding achieves the optimum for known symbol probabilities. These are the theoretical limits behind every lossless compressor.
Differential entropy and the Gaussian
For continuous variables, \(h(X) = -\int f(x)\log f(x)\,dx\). Among all distributions with a given variance, the Gaussian maximizes differential entropy — the precise sense in which it is the “least-assuming” continuous distribution and a justification for Gaussian noise models (Week 7).
Rate distortion and maximum entropy
Rate–distortion theory gives the minimum bits needed to represent a source within an allowed distortion \(D\) — the lossy-compression analogue of source coding. The maximum-entropy principle: subject to moment constraints (e.g. fixed mean/variance), the entropy-maximizing distribution is in the exponential family \(p(x)\propto \exp(\sum_i \lambda_i f_i(x))\). This is a convex optimization (Week 9) whose Lagrange multipliers are the natural parameters — and softmax is exactly the max-entropy distribution under a linear constraint. Information theory and statistics (method of types, Fisher information) connects entropy to hypothesis testing and estimation (Week 7).
Connections
- Forward / across courses: Cross-entropy and KL are the training and evaluation objectives for classifiers and LLMs (Course 3); entropy/mutual-information appear in feature selection and uncertainty (Course 1); compression and coding connect to data formats and DSP (Course 4).
Further Reading
- Cover & Thomas, Elements of Information Theory, 2nd ed., Chapters 2–5, 8, 10–12.
- MacKay, Information Theory, Inference, and Learning Algorithms, for the ML-flavored treatment.