The Probability Simplex
A probability distribution over q symbols is a point in the (q−1)-dimensional simplex: the set of all vectors (p₁, ..., p_q) with pᵢ ≥ 0 and Σ pᵢ = 1.
For q = 2: the simplex is a line segment [0,1], parameterized by a single probability p. For q = 3: the simplex is an equilateral triangle in ℝ². Every corner is a deterministic distribution (all probability on one symbol); the center is the uniform distribution.
Entropy H(p) assigns a real number to each point of the simplex. The function's geometry determines many fundamental results.
Concavity
H is concave on the simplex: for any two distributions p and q and any λ ∈ [0,1]:
H(λp + (1−λ)q) ≥ λH(p) + (1−λ)H(q)
A mixture of two distributions has entropy at least as large as the weighted average of their individual entropies. Intuition: mixing two sources increases uncertainty.
Verifying Concavity
For binary entropy H(p), concavity is visible in the graph: the curve bows upward, never falling below any chord connecting two points.
Formal test for concavity: the second derivative H''(p) ≤ 0 everywhere.
H(p) = −p log₂(p) − (1−p) log₂(1−p)
H'(p) = −log₂(p) − 1/ln(2) + log₂(1−p) + 1/ln(2) = log₂((1−p)/p)
H''(p) = −1/(p ln(2)) − 1/((1−p) ln(2)) = −1/(p(1−p) ln(2)) < 0 for all p ∈ (0,1)
The second derivative is strictly negative everywhere in the interior: H is strictly concave.
The Capacity-Achieving Distribution
Channel capacity is defined as the maximum mutual information over all input distributions p(x):
C = max_{p(x)} I(X; Y)
where I(X; Y) = H(Y) − H(Y|X) = H(X) − H(X|Y) = H(X) + H(Y) − H(X,Y).
For the binary symmetric channel with error probability Q: the capacity-achieving input distribution is the uniform distribution p(0) = p(1) = 0.5.
Why: H(Y) is maximized by the uniform output distribution. With a BSC, a uniform input gives a uniform output. Any other input distribution makes H(Y) smaller, reducing I(X;Y).
Geometrically: the mutual information I(X;Y) is a concave function of the input distribution p(x) on the simplex. The maximum of a concave function on a convex set is achieved at a unique point (the center, for a symmetric channel).
KL Divergence
The Kullback-Leibler divergence (relative entropy) from distribution q to distribution p:
D(p || q) = Σᵢ pᵢ log₂(pᵢ/qᵢ)
D(p || q) ≥ 0 always (Gibbs' inequality). D(p || q) = 0 if and only if p = q.
D is not a true distance: it is asymmetric (D(p||q) ≠ D(q||p) in general) and does not satisfy the triangle inequality. But it acts as a measure of how 'far' p is from q in probability space.
KL divergence appears throughout information theory:
- Mutual information: I(X;Y) = D(p(x,y) || p(x)p(y)). The mutual information is the KL divergence between the joint distribution and the product of marginals — how far the joint is from independence.
- Gibbs' inequality: the noiseless coding theorem follows directly from D(p || q) ≥ 0.
- Channel capacity: C = max_{p(x)} I(X;Y) = max_{p(x)} D(p(x,y) || p(x)p(y)).
Computing KL Divergence
Example: p = (0.5, 0.5) uniform binary, q = (0.8, 0.2) biased binary.
D(p || q) = 0.5 log₂(0.5/0.8) + 0.5 log₂(0.5/0.2)
= 0.5 log₂(0.625) + 0.5 log₂(2.5)
≈ 0.5 × (−0.678) + 0.5 × 1.322 ≈ −0.339 + 0.661 ≈ 0.322 bits
Channel Capacity as Geometric Distance
Channel capacity has a geometric interpretation in the space of probability distributions.
For a channel p(y|x), define the capacity-achieving input distribution p*(x). The capacity satisfies:
C = D(p*(y) || r(y))
where p(y) = Σ p(x) p(y|x) is the output distribution under the optimal input, and r(y) = argmin_r max_x D(p(y|x) || r(y)) is the minimum-information output distribution — the point in the output probability space closest (in KL divergence) to all conditional output distributions simultaneously.
This is the information-geometric view: channel capacity is the radius of the smallest KL-divergence ball in output distribution space that contains all conditional distributions p(y|x=0) and p(y|x=1).
For the BSC: p(y|x=0) = (1−Q, Q) and p(y|x=1) = (Q, 1−Q). By symmetry, the minimum-information output r(y) = (0.5, 0.5). Capacity = D((1−Q, Q) || (0.5, 0.5)) = 1 − H(Q). The formula recovers the standard result from geometry.
Capacity from KL Divergence
Verify the geometric formula: C = D(p(y|x=0) || r(y)) for a BSC with Q = 0.1, r(y) = (0.5, 0.5).
p(y|x=0) = (0.9, 0.1) (send 0, receive 0 with prob 0.9, receive 1 with prob 0.1).
D((0.9, 0.1) || (0.5, 0.5)) = 0.9 log₂(0.9/0.5) + 0.1 log₂(0.1/0.5)
= 0.9 log₂(1.8) + 0.1 log₂(0.2)
log₂(1.8) ≈ 0.848, log₂(0.2) ≈ −2.322
= 0.9×0.848 + 0.1×(−2.322) ≈ 0.763 − 0.232 ≈ 0.531 bits
Check: C = 1 − H(0.1) ≈ 1 − 0.469 = 0.531 bits ✓
Rate-Distortion & the Limits of Compression
Rate-distortion theory extends information theory to lossy compression. Instead of asking 'what is the minimum bits to represent a source exactly?' it asks: 'given tolerance for some average distortion D, what is the minimum rate R(D) bits per symbol?'
The rate-distortion function R(D) is convex and decreasing in D: more distortion tolerance allows lower rates. At D = 0 (lossless): R(0) = H(source). As D increases, R(D) → 0.
Geometrically: R(D) traces a curve on the (rate, distortion) plane. Every achievable (R, D) pair lies on or above this curve. Points below the curve are impossible — you cannot compress below the fundamental limit at any distortion level.
The rate-distortion theorem (Shannon, 1959): for any R > R(D), codes exist achieving expected distortion at most D. For R < R(D): no code achieves expected distortion D. The curve is a geometric frontier in (rate, distortion) space.