un — Geometry of Information Theory

un

guest

1 / ?

back to lessons

The Probability Simplex

A probability distribution over q symbols is a point in the (q−1)-dimensional simplex: the set of all vectors (p₁, ..., p_q) with pᵢ ≥ 0 and Σ pᵢ = 1.

For q = 2: the simplex is a line segment [0,1], parameterized by a single probability p. For q = 3: the simplex is an equilateral triangle in ℝ². Every corner is a deterministic distribution (all probability on one symbol); the center is the uniform distribution.

Entropy H(p) assigns a real number to each point of the simplex. The function's geometry determines many fundamental results.

Concavity

H is concave on the simplex: for any two distributions p and q and any λ ∈ [0,1]:

H(λp + (1−λ)q) ≥ λH(p) + (1−λ)H(q)

A mixture of two distributions has entropy at least as large as the weighted average of their individual entropies. Intuition: mixing two sources increases uncertainty.

Entropy Curve & Channel Capacity

Verifying Concavity

For binary entropy H(p), concavity is visible in the graph: the curve bows upward, never falling below any chord connecting two points.

Formal test for concavity: the second derivative H''(p) ≤ 0 everywhere.

H(p) = −p log₂(p) − (1−p) log₂(1−p)

H'(p) = −log₂(p) − 1/ln(2) + log₂(1−p) + 1/ln(2) = log₂((1−p)/p)

H''(p) = −1/(p ln(2)) − 1/((1−p) ln(2)) = −1/(p(1−p) ln(2)) < 0 for all p ∈ (0,1)

The second derivative is strictly negative everywhere in the interior: H is strictly concave.

Use the second derivative test to verify that H(p) is concave. Starting from H'(p) = log₂((1−p)/p), differentiate once more to get H''(p). Show the differentiation steps and confirm H''(p) < 0 for all p ∈ (0,1). What does strict concavity imply about the location of the maximum?

The Capacity-Achieving Distribution

Channel capacity is defined as the maximum mutual information over all input distributions p(x):

C = max_{p(x)} I(X; Y)

where I(X; Y) = H(Y) − H(Y|X) = H(X) − H(X|Y) = H(X) + H(Y) − H(X,Y).

For the binary symmetric channel with error probability Q: the capacity-achieving input distribution is the uniform distribution p(0) = p(1) = 0.5.

Why: H(Y) is maximized by the uniform output distribution. With a BSC, a uniform input gives a uniform output. Any other input distribution makes H(Y) smaller, reducing I(X;Y).

Geometrically: the mutual information I(X;Y) is a concave function of the input distribution p(x) on the simplex. The maximum of a concave function on a convex set is achieved at a unique point (the center, for a symmetric channel).

The mutual information I(X;Y) is concave in p(x) and convex in the channel p(y|x). For a binary symmetric channel with Q = 0.3, compute the channel capacity C. Then explain geometrically why the maximum of I(X;Y) over input distributions is achieved at p(0) = p(1) = 0.5 for a symmetric channel.

KL Divergence

The Kullback-Leibler divergence (relative entropy) from distribution q to distribution p:

D(p || q) = Σᵢ pᵢ log₂(pᵢ/qᵢ)

D(p || q) ≥ 0 always (Gibbs' inequality). D(p || q) = 0 if and only if p = q.

D is not a true distance: it is asymmetric (D(p||q) ≠ D(q||p) in general) and does not satisfy the triangle inequality. But it acts as a measure of how 'far' p is from q in probability space.

KL divergence appears throughout information theory:

- Mutual information: I(X;Y) = D(p(x,y) || p(x)p(y)). The mutual information is the KL divergence between the joint distribution and the product of marginals — how far the joint is from independence.

- Gibbs' inequality: the noiseless coding theorem follows directly from D(p || q) ≥ 0.

- Channel capacity: C = max_{p(x)} I(X;Y) = max_{p(x)} D(p(x,y) || p(x)p(y)).

Geometry in Probability Space

Computing KL Divergence

Example: p = (0.5, 0.5) uniform binary, q = (0.8, 0.2) biased binary.

D(p || q) = 0.5 log₂(0.5/0.8) + 0.5 log₂(0.5/0.2)

= 0.5 log₂(0.625) + 0.5 log₂(2.5)

≈ 0.5 × (−0.678) + 0.5 × 1.322 ≈ −0.339 + 0.661 ≈ 0.322 bits

Compute D(q || p) for p = (0.5, 0.5) and q = (0.8, 0.2). Show the formula with substituted values. Then compare D(q||p) vs. D(p||q) ≈ 0.322 bits. Are they equal? What does this asymmetry mean geometrically — why is KL divergence not a true distance metric?

Channel Capacity as Geometric Distance

Channel capacity has a geometric interpretation in the space of probability distributions.

For a channel p(y|x), define the capacity-achieving input distribution p*(x). The capacity satisfies:

C = D(p*(y) || r(y))

where p(y) = Σ p(x) p(y|x) is the output distribution under the optimal input, and r(y) = argmin_r max_x D(p(y|x) || r(y)) is the minimum-information output distribution — the point in the output probability space closest (in KL divergence) to all conditional output distributions simultaneously.

This is the information-geometric view: channel capacity is the radius of the smallest KL-divergence ball in output distribution space that contains all conditional distributions p(y|x=0) and p(y|x=1).

For the BSC: p(y|x=0) = (1−Q, Q) and p(y|x=1) = (Q, 1−Q). By symmetry, the minimum-information output r(y) = (0.5, 0.5). Capacity = D((1−Q, Q) || (0.5, 0.5)) = 1 − H(Q). The formula recovers the standard result from geometry.

Capacity from KL Divergence

Verify the geometric formula: C = D(p(y|x=0) || r(y)) for a BSC with Q = 0.1, r(y) = (0.5, 0.5).

p(y|x=0) = (0.9, 0.1) (send 0, receive 0 with prob 0.9, receive 1 with prob 0.1).

D((0.9, 0.1) || (0.5, 0.5)) = 0.9 log₂(0.9/0.5) + 0.1 log₂(0.1/0.5)

= 0.9 log₂(1.8) + 0.1 log₂(0.2)

log₂(1.8) ≈ 0.848, log₂(0.2) ≈ −2.322

= 0.9×0.848 + 0.1×(−2.322) ≈ 0.763 − 0.232 ≈ 0.531 bits

Check: C = 1 − H(0.1) ≈ 1 − 0.469 = 0.531 bits ✓

For a BSC with Q = 0.2, verify the geometric capacity formula by computing D(p(y|x=0) || r(y)) where p(y|x=0) = (0.8, 0.2) and r(y) = (0.5, 0.5). Use log₂(1.6) ≈ 0.678 and log₂(0.4) ≈ −1.322. Then confirm the result matches C = 1 − H(0.2).

Rate-Distortion & the Limits of Compression

Rate-distortion theory extends information theory to lossy compression. Instead of asking 'what is the minimum bits to represent a source exactly?' it asks: 'given tolerance for some average distortion D, what is the minimum rate R(D) bits per symbol?'

The rate-distortion function R(D) is convex and decreasing in D: more distortion tolerance allows lower rates. At D = 0 (lossless): R(0) = H(source). As D increases, R(D) → 0.

Geometrically: R(D) traces a curve on the (rate, distortion) plane. Every achievable (R, D) pair lies on or above this curve. Points below the curve are impossible — you cannot compress below the fundamental limit at any distortion level.

The rate-distortion theorem (Shannon, 1959): for any R > R(D), codes exist achieving expected distortion at most D. For R < R(D): no code achieves expected distortion D. The curve is a geometric frontier in (rate, distortion) space.

The rate-distortion function R(D) is convex and decreasing. Describe in geometric terms what the convexity of R(D) implies about the marginal cost of reducing distortion as you approach D = 0. Then connect this to a practical engineering tradeoff: why do lossy compression formats (JPEG, MP3) operate far above D = 0?