un — Hamming Ch. 11: Coding Theory II

un

guest

1 / ?

back to lessons

How Huffman Builds the Optimal Code

Hamming presents Huffman coding as a greedy algorithm that constructs a minimum-average-length prefix-free code. The key idea: build the tree bottom-up, always merging the two least probable symbols.

The Forward Pass (Build)

1. Sort all symbols by probability, highest to lowest.

2. Take the two lowest-probability symbols. Combine them into a new meta-symbol with probability = sum of the two.

3. Insert the meta-symbol in its sorted position.

4. Repeat until only two symbols remain.

5. Assign 0 and 1 to the two remaining symbols.

The Backward Pass (Assign Codes)

Undo the merges in reverse: at each split, the two symbols that were merged receive the same prefix bits as their combined parent, plus an additional 0 (one child) or 1 (other child). The 0/1 assignment is arbitrary — either is valid.

Huffman Build: Merge and Decode Tree

Why this is optimal: if any other code had smaller average length L', applying the same forward reduction would eventually produce two symbols with average length less than 1 bit — impossible for a 2-symbol code. Contradiction.

Tracing the Algorithm

Example from Hamming: four symbols A, B, C, D with p(A)=0.5, p(B)=0.25, p(C)=0.125, p(D)=0.125.

Forward pass:

Step 1: Sorted: A(0.5), B(0.25), C(0.125), D(0.125). Lowest two: C, D.

Step 2: Merge CD with p=0.25. New list: A(0.5), B(0.25), CD(0.25).

Step 3: Lowest two: B(0.25), CD(0.25). Merge BCD with p=0.5.

Step 4: Two symbols remain: A(0.5), BCD(0.5). Assign A=0, BCD=1.

Backward pass:

BCD → B gets 10, CD gets 11. CD → C gets 110, D gets 111.

Final code: A=0 (l=1), B=10 (l=2), C=110 (l=3), D=111 (l=3).

L = 0.5·1 + 0.25·2 + 0.125·3 + 0.125·3 = 1.75 bits.

Apply the Huffman algorithm to: p(X1)=0.4, p(X2)=0.3, p(X3)=0.2, p(X4)=0.1. Show all merge steps, the final code for each symbol, and compute L.

Multiple Valid Huffman Codes

Hamming notes two sources of non-uniqueness:

1. Arbitrary 0/1 assignment. At each split, you can give 0 to either child. Swapping 0 and 1 throughout gives a different code with identical L.

2. Tie-breaking. When two nodes have equal probability, either can be placed 'higher' (combined first). Different tie-breaking choices can produce structurally different trees — 'long' vs 'bushy' — with the same L but different code-length distributions.

Long vs Bushy

Long tree (skewed): one symbol at each depth level (Fibonacci-like structure). Code lengths vary widely: one symbol gets a short code, others cascade to longer codes.

Bushy tree (balanced): symbols spread evenly across depth levels. Code lengths cluster near the average. Lower variance.

Long vs Bushy Huffman Trees

Hamming's recommendation: when given a choice, prefer the bushy tree. Same L, but smaller variance in code lengths → more uniform decoding time → smaller buffer requirements in real-time applications.

Computing Code-Length Variance

Variance of code lengths: Var = E[l²] − (E[l])²

For the code {A=0 l=1, B=10 l=2, C=110 l=3, D=111 l=3} with p=[0.5, 0.25, 0.125, 0.125]:

E[l] = L = 1.75

E[l²] = 0.5·1² + 0.25·2² + 0.125·3² + 0.125·3² = 0.5 + 1.0 + 1.125 + 1.125 = 3.75

Var = 3.75 − 1.75² = 3.75 − 3.0625 = 0.6875

An alternative bushy code for equal-probability symbols uses all length-2 codes: L=2, Var=0.

Consider 4 equiprobable symbols (p=0.25 each). Compute H. Then compare: (a) the bushy code {00, 01, 10, 11} with all lengths = 2, and (b) a long code with lengths {1, 2, 3, 3}. Compute L and Var for each. Which should you prefer, and why?

Compression Gains vs Symbol Distribution

Hamming's rule: Huffman coding pays off only when symbol probabilities differ substantially.

Equal probabilities. If all 2^k symbols have equal probability 1/2^k, Huffman produces a block code: every symbol gets length k. L = H = k. No gain over the simple block code.

Skewed probabilities. If one symbol dominates (p >> 1/2^k for others), Huffman assigns it a very short code (length 1 or 2), dramatically reducing L.

The comma code (Hamming's term). When each probability exceeds 2/3 of the remaining total, Huffman naturally produces: p(s1)→0, p(s2)→10, p(s3)→110, ..., down to two equal-length codes at the end. This is a 'comma code': the trailing 0 after a run of 1s acts like a comma.

Hamming notes: real-world data compression (backup, file archiving) can cut storage by more than half when the source has skewed probabilities — English text being a prime example.

Huffman vs Block Code: Numeric Comparison

Hamming's second example: p(s1)=1/3, p(s2)=1/5, p(s3)=1/6, p(s4)=1/10, p(s5)=1/12, p(s6)=1/20, p(s7)=1/30, p(s8)=1/30.

Block code: 8 symbols → 3 bits each → L_block = 3.

Hamming computes the Huffman code and reports L_Huffman ≈ 2.58 bits.

Savings: (3 − 2.58)/3 ≈ 14% compression over block coding.

When symbol probabilities differ greatly (1/3 vs 1/30 here, a 10:1 ratio), variable-length coding pays substantially.

The 8-symbol source above has H ≈ 2.55 bits (you do not need to verify this). Hamming's Huffman code achieves L ≈ 2.58 bits. The block code uses L = 3 bits. Compute: (a) L/H for the Huffman code, (b) L/H for the block code, and (c) what these ratios tell you about the efficiency of each encoding relative to the theoretical minimum.

Self-Compiling Programs

Hamming ends Chapter 11 with a striking idea: a Huffman encoder can encode itself. The decoding tree (which tells the decoder how to read the compressed message) gets transmitted along with the compressed data. The receiver needs no prior knowledge of the code.

The encoder: samples the data, estimates probabilities, constructs the Huffman code, encodes the decoding tree as a header, then encodes the data.

The decoder: reads the header to reconstruct the tree, then decodes the data using it.

Hamming's point: this entire pipeline can run as a library function with no human involvement. You call it, it compresses and decompresses automatically. Modern archiving formats (ZIP, gzip, bzip2) implement exactly this pattern.

The deeper principle: the intelligence is in the algorithm, not in a fixed code table. The algorithm adapts to whatever data it receives. This is the pattern Hamming sees across all great engineering systems: adaptability built in, not bolted on.

Hamming says the self-compiling Huffman encoder 'requires no human interference or thought.' What is the engineering virtue of this property, and what does it require of the algorithm's design? Give one concrete example of a modern system that embodies this same principle.