Code as a Tree
A prefix-free code maps to a binary tree: the root represents the start of decoding, left branches represent bit 0, right branches represent bit 1, and leaves represent complete codewords.
The geometric constraint: every leaf terminates a root-to-leaf path. No leaf can have a descendant (if it did, its codeword would be a prefix of the descendant's codeword, violating prefix-free property).
This gives the Kraft inequality a geometric interpretation:
Each leaf at depth d 'occupies' a fraction 2^(−d) of the total tree capacity. The total capacity of a complete binary tree at depth D is 1. A prefix-free code uses leaves at various depths; the Kraft sum measures total occupancy.
Kraft sum = 1: complete code (every path ends at a leaf — perfect packing).
Kraft sum < 1: incomplete code (some tree capacity unused — more symbols could be added).
Kraft sum > 1: impossible for prefix-free codes (some paths would have to share a leaf).
Deeper Leaves = Longer Codes = Less Capacity
A leaf at depth 1 uses half the tree capacity (2^(−1) = 0.5).
A leaf at depth 3 uses one-eighth (2^(−3) = 0.125).
Placing a short codeword high in the tree blocks all its descendants from being used — you trade one short code for many potential longer codes.
Building a Prefix-Free Tree
Consider 5 symbols with lengths l = {1, 2, 3, 3, 3}. Kraft sum = 2^(−1) + 2^(−2) + 3·2^(−3) = 0.5 + 0.25 + 0.375 = 1.125 > 1.
This exceeds 1: these lengths are incompatible with a prefix-free code. At least one length must increase.
Fix: change l_1 from 1 to 2. New lengths {2, 2, 3, 3, 3}: Kraft = 2·2^(−2) + 3·2^(−3) = 0.5 + 0.375 = 0.875 < 1. Valid, but incomplete.
Fix: change l_1 from 2 to 2, add l_2 = 3 to use remaining capacity. Kraft = 0.875; remaining = 0.125 = 2^(−3): room for one more depth-3 leaf.
Why Entropy Bounds Code Length
Kraft's inequality constrains code structure (lengths must fit in the tree). Shannon's entropy constrains code efficiency (average length cannot beat a theoretical floor).
Optimal code lengths. For a symbol with probability p_i, the ideal code length is log₂(1/p_i). Rare symbols deserve long codes; frequent symbols deserve short ones. This mirrors the Kraft equality: 2^(−l_i) = p_i when l_i = log₂(1/p_i).
But log₂(1/p_i) is rarely an integer. Rounding up to ⌈log₂(1/p_i)⌉ gives the Huffman length, which satisfies H ≤ L < H + 1.
Geometric reading. Plot each symbol as a point on the binary tree at depth l_i. The Kraft sum measures total leaf coverage. Entropy measures the weighted average depth, weighted by probability. Shannon's theorem: probability-weighted average depth ≥ probability-weighted information content.
Entropy Bound Verification
The worked example: p = [1/2, 1/4, 1/8, 1/8] for symbols {A, B, C, D}.
Optimal lengths: l_A = log₂(2) = 1, l_B = log₂(4) = 2, l_C = log₂(8) = 3, l_D = log₂(8) = 3.
These are all integers — a perfect match. Huffman code: A=0, B=10, C=110, D=111.
L = 0.5·1 + 0.25·2 + 0.125·3 + 0.125·3 = 0.5 + 0.5 + 0.375 + 0.375 = 1.75.
H = 0.5·1 + 0.25·2 + 0.125·3 + 0.125·3 = 1.75. L = H exactly: optimal code.
A Full Worked Example
Full pipeline: given probabilities, compute entropy, verify a code meets the bound.
Source: p(A)=0.5, p(B)=0.25, p(C)=0.125, p(D)=0.125.
H = 1.75 bits (computed above).
A naive block code: 4 symbols → 2 bits each → L = 2.0. Overhead over entropy: 2.0 − 1.75 = 0.25 bits/symbol = 12.5% waste.
The variable-length code saves 12.5% compared to fixed-length. For large messages, this is substantial.
Entropy rate of English. Shannon estimated the entropy of written English at about 1.0 to 1.3 bits per character, despite using 5-bit ASCII codes. That 4:1 ratio reflects the massive redundancy in natural language — common letters cluster, words repeat, grammar constrains sequences.
Compression Cannot Beat Entropy
Compression ratio: H / (block code length). For our example: 1.75/2.0 = 0.875 — 87.5% efficiency.
Can you compress further? Only by using context: if you encode pairs or triples of symbols together, the joint entropy H(X,Y) may be less than 2·H(X) due to statistical dependencies. This is the extension of the noiseless coding theorem to n-grams.