un — Information Theory

un

guest

1 / ?

back to lessons

What Shannon Called Information

Shannon defined information by measuring surprise. A message with probability p carries:

I = −log₂(p) bits

A certain event (p = 1) carries 0 bits — no surprise, no information. A rare event (p = 1/1024) carries 10 bits.

Hamming immediately flags the problem: this is a formula for measuring a quantity, not a definition of the concept. It measures machine-like surprise, not human meaning. A student who already knows the answer to a question receives 0 bits from the answer — regardless of how important the answer is to others.

The formula applies well to telephone systems, radio, computers. It applies poorly to human communication, biology, or meaning. Hamming's preferred name: 'Communication Theory', not 'Information Theory.'

Entropy

For an alphabet of q symbols with probabilities p₁, p₂, ..., p_q, the average information per symbol is the entropy:

H = −Σᵢ pᵢ log₂(pᵢ)

H reaches its maximum when all probabilities are equal: H_max = log₂(q) bits. Any non-uniform distribution has lower entropy.

Computing Entropy

Binary entropy: a source with two symbols, P(0) = p, P(1) = 1−p.

H(p) = −p log₂(p) − (1−p) log₂(1−p)

H(p) = 0 at p = 0 or p = 1 (completely predictable). H(p) = 1 bit at p = 0.5 (completely unpredictable).

Binary Entropy & Channel Capacity

Compute H(p) for p = 0.25. Show the formula with numbers substituted, evaluate both terms, and state the result in bits. Then interpret: what does H(0.25) < H(0.5) tell you about the information content of a biased coin flip vs. a fair coin flip?

Gibbs' Inequality & Noiseless Coding

Gibbs' inequality: for any two probability distributions p = {pᵢ} and q = {qᵢ}:

−Σ pᵢ log₂(pᵢ) ≤ −Σ pᵢ log₂(qᵢ)

with equality only when p = q. This rests on the elementary fact that ln(x) ≤ x − 1 for all x > 0, with equality at x = 1.

Consequence: entropy H(p) is maximized when all symbols are equally probable. For q symbols: H_max = log₂(q).

Noiseless coding theorem: given a uniquely decodable code, the Kraft inequality requires Σ 2^(−lᵢ) ≤ 1 where lᵢ is the length of the code for symbol i. By Gibbs' inequality, the average code length L = Σ pᵢ lᵢ satisfies:

L ≥ H(p) = −Σ pᵢ log₂(pᵢ)

You cannot do better than entropy on average. Huffman coding achieves L < H + 1.

The Gibbs inequality says H(p) ≤ −Σ pᵢ log₂(qᵢ) for any distribution q. When q is the uniform distribution qᵢ = 1/q for all i, the right side simplifies to log₂(q). Show this simplification algebraically, then state what it implies about the maximum entropy of a q-symbol alphabet.

Channel Capacity

A binary symmetric channel (BSC) flips each bit independently with error probability Q = 1 − P. The capacity of the BSC — maximum reliable information rate — is:

C = 1 + P log₂(P) + Q log₂(Q) = 1 − H(Q)

where H(Q) = −Q log₂(Q) − (1−Q) log₂(1−Q) is the binary entropy of the error rate.

At Q = 0 (no errors): C = 1 bit/transmission (perfect channel). At Q = 0.5 (random flipping): C = 0 (channel carries no information). At Q = 1 (all bits flip): C = 1 (you know exactly what the sender sent, just flip everything back).

C measures the maximum rate R at which you can transmit with arbitrarily small error probability. If R < C, such codes exist. If R > C, they do not exist — no code can beat the capacity.

Entropy & Channel Capacity

Computing Channel Capacity

With P = 0.9 (10% error rate, Q = 0.1):

C = 1 + 0.9 log₂(0.9) + 0.1 log₂(0.1)

log₂(0.9) ≈ −0.152, log₂(0.1) ≈ −3.322

C ≈ 1 + 0.9×(−0.152) + 0.1×(−3.322) = 1 − 0.137 − 0.332 ≈ 0.531 bits/transmission

A binary symmetric channel has error probability Q = 0.2 (P = 0.8). Compute the channel capacity C = 1 + P log₂(P) + Q log₂(Q). Use log₂(0.8) ≈ −0.322 and log₂(0.2) ≈ −2.322. Show your substitution and arithmetic, then interpret: at this capacity, what fraction of the raw bit rate can carry actual information?

What the Theorem Proves

Shannon's fundamental theorem: for any rate R < C, there exist codes of block length n (with n → ∞) achieving error probability P_E → 0.

The proof uses a surprising argument: random codes. Instead of constructing a specific code, Shannon averaged over all possible random code books (coin-flip encoding). He showed the average error over all code books is small. If the average is small, at least one code achieves small error.

Sphere-based analysis: the sender picks message aᵢ → sphere of radius n(Q + ε₂) around aᵢ in n-dimensional binary space. For large n, the received word bⱼ lies inside this sphere with high probability. The receiver decodes to the codeword whose sphere contains bⱼ.

Four cases determine error probability P_E:

``aᵢ in sphere other aⱼ in sphere outcome yes no correct (no error) yes yes ambiguous → error no yes wrong decode → error no no outside all spheres → error``

Information Geometry & Sphere Packing

The bound on P_E works out to: P_E ≤ d + M × 2^(n × (H(Q+ε₂) − C)) for suitably chosen d and ε₂. Choosing ε₂ so H(Q+ε₂) < C makes the exponent negative. For large n, the second term → 0.

The Existential Nature of the Theorem

Hamming was precise about what the theorem does and does not provide.

What it proves: reliable communication at rate R < C is possible, in principle, for large enough n.

What it does not provide: explicit code construction. A random code of length n large enough to approach capacity has a code book of size M × n bits, where both M and n are astronomically large. You cannot store or compute with it.

Error-correcting codes vs. Shannon: error-correcting codes (Hamming, Reed-Solomon, turbo, LDPC) provide explicit, computable constructions. They sacrifice some distance from capacity in exchange for practical encoders & decoders. As n grows and more errors are corrected per block, practical codes can approach capacity closely.

The space satellites example: Voyager and Pioneer used powerful error-correcting codes to communicate across billions of miles on 5–20 watts of power. Long block lengths allowed more errors per block to be corrected, pushing close to capacity despite the enormous noise from distance.

Critical Assessment

Hamming closed Chapter 13 with a broader critique of definitions in science. Shannon's information formula measures machine-surprise, not human meaning. The name 'Information Theory' over-promises. The fishing-net analogy: a fisherman who catches only fish larger than the net's mesh concludes there are no smaller fish. The tool's limitations become the world's apparent constraints.

Shannon's theorem proves that codes achieving arbitrarily small error at rate R < C exist, but the proof is non-constructive: it shows existence by averaging over random code books, not by building a code. Explain in your own words why this matters practically, and describe what the gap between Shannon's existence proof and a working error-correcting code requires engineers to solve.

The Problem with Definitions

Hamming used information theory to make a larger methodological point: initial definitions determine what you find, more than most people realize.

Shannon chose to define 'information' as surprise. That definition was productive for communication engineering. But it imported a specific scope — machine-like systems — into a word ('information') that suggests universal applicability.

The fishing net analogy: a net with 6-inch mesh catches only large fish. The fisherman concludes: minimum fish size is 6 inches. The conclusion reflects the tool, not the world.

IQ as a parallel: a test designed to measure 'intelligence', calibrated to produce a normal distribution, then used to define intelligence. The tool shapes the concept.

Hamming's recommendation: whenever you encounter a definition, ask (1) how much does it agree with your prior intuition? (2) how much does it distort? (3) under what conditions was it framed? (4) is it being applied now under different conditions?

Apply Hamming's four-question critique to Shannon's definition of information. For each of the four questions, give a specific answer that shows you have engaged with both the definition and its limits.