What Shannon Called Information
Shannon defined information by measuring surprise. A message with probability p carries:
I = −log₂(p) bits
A certain event (p = 1) carries 0 bits — no surprise, no information. A rare event (p = 1/1024) carries 10 bits.
Hamming immediately flags the problem: this is a formula for measuring a quantity, not a definition of the concept. It measures machine-like surprise, not human meaning. A student who already knows the answer to a question receives 0 bits from the answer — regardless of how important the answer is to others.
The formula applies well to telephone systems, radio, computers. It applies poorly to human communication, biology, or meaning. Hamming's preferred name: 'Communication Theory', not 'Information Theory.'
Entropy
For an alphabet of q symbols with probabilities p₁, p₂, ..., p_q, the average information per symbol is the entropy:
H = −Σᵢ pᵢ log₂(pᵢ)
H reaches its maximum when all probabilities are equal: H_max = log₂(q) bits. Any non-uniform distribution has lower entropy.
Computing Entropy
Binary entropy: a source with two symbols, P(0) = p, P(1) = 1−p.
H(p) = −p log₂(p) − (1−p) log₂(1−p)
H(p) = 0 at p = 0 or p = 1 (completely predictable). H(p) = 1 bit at p = 0.5 (completely unpredictable).
Gibbs' Inequality & Noiseless Coding
Gibbs' inequality: for any two probability distributions p = {pᵢ} and q = {qᵢ}:
−Σ pᵢ log₂(pᵢ) ≤ −Σ pᵢ log₂(qᵢ)
with equality only when p = q. This rests on the elementary fact that ln(x) ≤ x − 1 for all x > 0, with equality at x = 1.
Consequence: entropy H(p) is maximized when all symbols are equally probable. For q symbols: H_max = log₂(q).
Noiseless coding theorem: given a uniquely decodable code, the Kraft inequality requires Σ 2^(−lᵢ) ≤ 1 where lᵢ is the length of the code for symbol i. By Gibbs' inequality, the average code length L = Σ pᵢ lᵢ satisfies:
L ≥ H(p) = −Σ pᵢ log₂(pᵢ)
You cannot do better than entropy on average. Huffman coding achieves L < H + 1.
Channel Capacity
A binary symmetric channel (BSC) flips each bit independently with error probability Q = 1 − P. The capacity of the BSC — maximum reliable information rate — is:
C = 1 + P log₂(P) + Q log₂(Q) = 1 − H(Q)
where H(Q) = −Q log₂(Q) − (1−Q) log₂(1−Q) is the binary entropy of the error rate.
At Q = 0 (no errors): C = 1 bit/transmission (perfect channel). At Q = 0.5 (random flipping): C = 0 (channel carries no information). At Q = 1 (all bits flip): C = 1 (you know exactly what the sender sent, just flip everything back).
C measures the maximum rate R at which you can transmit with arbitrarily small error probability. If R < C, such codes exist. If R > C, they do not exist — no code can beat the capacity.
Computing Channel Capacity
With P = 0.9 (10% error rate, Q = 0.1):
C = 1 + 0.9 log₂(0.9) + 0.1 log₂(0.1)
log₂(0.9) ≈ −0.152, log₂(0.1) ≈ −3.322
C ≈ 1 + 0.9×(−0.152) + 0.1×(−3.322) = 1 − 0.137 − 0.332 ≈ 0.531 bits/transmission
What the Theorem Proves
Shannon's fundamental theorem: for any rate R < C, there exist codes of block length n (with n → ∞) achieving error probability P_E → 0.
The proof uses a surprising argument: random codes. Instead of constructing a specific code, Shannon averaged over all possible random code books (coin-flip encoding). He showed the average error over all code books is small. If the average is small, at least one code achieves small error.
Sphere-based analysis: the sender picks message aᵢ → sphere of radius n(Q + ε₂) around aᵢ in n-dimensional binary space. For large n, the received word bⱼ lies inside this sphere with high probability. The receiver decodes to the codeword whose sphere contains bⱼ.
Four cases determine error probability P_E:
``
aᵢ in sphere other aⱼ in sphere outcome
yes no correct (no error)
yes yes ambiguous → error
no yes wrong decode → error
no no outside all spheres → error
``
The bound on P_E works out to: P_E ≤ d + M × 2^(n × (H(Q+ε₂) − C)) for suitably chosen d and ε₂. Choosing ε₂ so H(Q+ε₂) < C makes the exponent negative. For large n, the second term → 0.
The Existential Nature of the Theorem
Hamming was precise about what the theorem does and does not provide.
What it proves: reliable communication at rate R < C is possible, in principle, for large enough n.
What it does not provide: explicit code construction. A random code of length n large enough to approach capacity has a code book of size M × n bits, where both M and n are astronomically large. You cannot store or compute with it.
Error-correcting codes vs. Shannon: error-correcting codes (Hamming, Reed-Solomon, turbo, LDPC) provide explicit, computable constructions. They sacrifice some distance from capacity in exchange for practical encoders & decoders. As n grows and more errors are corrected per block, practical codes can approach capacity closely.
The space satellites example: Voyager and Pioneer used powerful error-correcting codes to communicate across billions of miles on 5–20 watts of power. Long block lengths allowed more errors per block to be corrected, pushing close to capacity despite the enormous noise from distance.
Critical Assessment
Hamming closed Chapter 13 with a broader critique of definitions in science. Shannon's information formula measures machine-surprise, not human meaning. The name 'Information Theory' over-promises. The fishing-net analogy: a fisherman who catches only fish larger than the net's mesh concludes there are no smaller fish. The tool's limitations become the world's apparent constraints.
The Problem with Definitions
Hamming used information theory to make a larger methodological point: initial definitions determine what you find, more than most people realize.
Shannon chose to define 'information' as surprise. That definition was productive for communication engineering. But it imported a specific scope — machine-like systems — into a word ('information') that suggests universal applicability.
The fishing net analogy: a net with 6-inch mesh catches only large fish. The fisherman concludes: minimum fish size is 6 inches. The conclusion reflects the tool, not the world.
IQ as a parallel: a test designed to measure 'intelligence', calibrated to produce a normal distribution, then used to define intelligence. The tool shapes the concept.
Hamming's recommendation: whenever you encounter a definition, ask (1) how much does it agree with your prior intuition? (2) how much does it distort? (3) under what conditions was it framed? (4) is it being applied now under different conditions?