un — Error Correcting Codes

un

guest

1 / ?

back to lessons

Bell Labs, 1947

Richard Hamming joined Bell Telephone Laboratories in 1946. The relay computers there ran only on weekdays, when technicians could restart them after errors. Weekends, the machines stopped whenever something went wrong — leaving jobs queued until Monday.

Hamming was furious. 'If the machine can detect an error,' he thought, 'why can't it locate it and correct it?' This frustration, combined with deep familiarity with binary arithmetic and parity checks, set the stage.

First Insight: Rectangular Codes

Hamming's first idea: arrange message bits in an m×n rectangle, add a parity check to every row & every column. A single error produces exactly one failing row check & one failing column check. Their intersection names the error's position.

Redundancy ratio: (m+1)(n+1) / mn. Calculus tells you a square minimizes this for fixed message size. But as m & n grow, a double error becomes more likely — an engineering judgment with no universal answer.

Parity Check Matrix & Syndrome Table

Minimizing Rectangular Redundancy

A 4×4 rectangle carries 16 message bits with 4 row checks & 4 column checks, plus 1 corner parity bit = 9 check bits for 16 message bits.

Redundancy ratio: (m+1)(n+1) / mn = 25/16 ≈ 1.56.

For a 10×10 rectangle: 100 message bits, 121 total bits, redundancy ≈ 1.21.

Why does minimizing the rectangular redundancy ratio lead to a square geometry? Use the formula (m+1)(n+1)/mn and calculus or a simple argument to show that m = n minimizes redundancy for fixed message count m·n = k.

The Syndrome as a Binary Number

A few weeks after the rectangular code insight, Hamming was riding toward New York through New Jersey farmland, mentally reviewing his successes. The triangle code occurred to him — better redundancy. Then the cube. Then 4-dimensional, 5-dimensional...

Each extra dimension improved redundancy. A hypercube of side 2 in n dimensions uses only n+1 parity checks for 2^n vertices. But was this optimal?

The Counting Argument

n+1 parity bits produce a syndrome: an (n+1)-bit binary number. That syndrome needs to identify 2^n + 1 distinct outcomes: each of the 2^n error positions, plus the special 'no error' outcome.

2^(n+1) = 2·2^n — almost enough. Off by a factor of 2. Hamming shelved the problem.

The Key Insight

Later, Hamming returned with a new idea: use the syndrome itself as a binary number naming the error position. Position 1 = binary 001, position 2 = binary 010, position 3 = binary 011, etc. Reserve all-zeros for 'no error.'

This transforms the syndrome from an output of parity checks into an address. The parity checks are designed to produce exactly the right address when any single bit flips.

Designing the (7,4) Code

For a 7-bit code (3 parity bits, 4 message bits), positions 1 through 7 in binary are: 001, 010, 011, 100, 101, 110, 111.

Parity check 1 covers positions where bit 0 = 1: positions 1, 3, 5, 7.

Parity check 2 covers positions where bit 1 = 1: positions 2, 3, 6, 7.

Parity check 3 covers positions where bit 2 = 1: positions 4, 5, 6, 7.

Parity bits occupy power-of-2 positions: 1, 2, 4. Message bits occupy the rest: 3, 5, 6, 7.

If bit 6 flips, parity checks 2 & 3 fail (110 in binary = 6). The syndrome reads 110 = 6. Flip position 6. Done.

A (7,4) Hamming codeword is received as: positions 1-7 = 0 0 1 1 0 1 1. Apply the three parity checks (covering positions {1,3,5,7}, {2,3,6,7}, {4,5,6,7}). Compute the syndrome. Which bit position has an error? Write the corrected codeword, then extract the four message bits.

Single Error Correct, Double Error Detect

The (7,4) Hamming code corrects single errors. But what if two bits flip? Without extra protection, the decoder applies the syndrome rule and 'corrects' the codeword to the wrong message — a silent miscorrection.

SECDED: One More Parity Bit

Add a single parity bit p₀ covering the entire codeword (all 7 bits). Now the syndrome has 4 entries: the original 3 checks plus p₀.

``old syndrome new p₀ meaning 000 0 correct 000 1 error in p₀ only xxx 1 single error, old syndrome names it xxx 0 double error — flag it``

The four cases are exhaustive. A double error flips two bits: the old syndrome won't read 000 (both bits together corrupt two of its checks), but p₀ flips twice and returns to 0. The pattern xxx + 0 is unmistakable.

Why SECDED Works

The SECDED rule exploits parity's modular structure. With even parity, any single flip changes p₀. Any double flip leaves p₀ unchanged. So p₀ counts errors modulo 2.

Consider a SECDED-protected codeword. After transmission you observe: old syndrome = 101, new parity p₀ = 0. What happened? Then explain: why does the combination (syndrome ≠ 000) AND (p₀ = 0) uniquely signal a double error, not a single error or no error?

The Geometric Picture

Hamming arrived at the same place from a different direction: analytic geometry. Represent each n-bit string as a vertex of an n-dimensional hypercube. A single bit flip moves a point one edge-length along one axis. Two flips: two edges. The metric is Hamming distance.

Define a Hamming ball of radius t around a codeword c: all points within t bit-flips of c. For single-error correction, t = 1.

Condition for single-error correction: balls of radius 1 around every pair of distinct codewords must not overlap. If they overlap, a received word in the overlap could belong to either codeword and the decoder fails.

This translates directly to minimum distance: two balls of radius 1 don't overlap if and only if the codewords are at least 3 apart (d_min ≥ 3).

The (7,4) code achieves d_min = 3. Hamming's bound: 2^7 / (1 + 7) = 16 = 2^4. Exactly 16 codewords. A perfect code: the 16 balls of radius 1 tile {0,1}^7 with no gaps or overlaps.

Coset Structure & Syndrome Decoding

Hamming's bound says M ≤ 2^n / Vol(n, t) where Vol(n, 1) = 1 + n. For n = 7, t = 1: the (7,4) code achieves M = 16, meeting the bound exactly. What does 'meeting the bound exactly' mean geometrically? And what does it imply about maintenance & field repair of equipment built with Hamming codes?

Engineering Judgments

Hamming was explicit: error-correcting codes involve engineering judgments, not pure mathematics.

Message length: longer messages allow more efficient coding (log n parity bits for n message bits). But longer messages also accumulate more noise, raising the risk of a double error slipping through as a false single correction.

Level of correction vs. detection: trading one error correction for two extra error detections. The optimal split depends on the channel's noise characteristics.

Field maintenance value: as equipment grows more complex, field technicians cannot diagnose every failure from first principles. A self-diagnosing system dramatically reduces the skill required for maintenance. Hamming called this one of the most important benefits, often more important than the raw reliability gain.

Style: Luck Favors the Prepared Mind

Hamming closed Chapter 12 with a direct challenge. He described the discovery as requiring three to six months of work, mostly at odd moments while carrying his main duties at Bell Labs.

He identified three things that made it possible:

1. Preparation: deep familiarity with parity checks, binary arithmetic, & group theory, before the problem appeared.

2. Reviewing successes: habitually replaying past solutions to internalize their style. The triangle code occurred to him while mentally reviewing the rectangular code on a commute.

3. Not settling for 'looks good': he burned his fingers once by accepting apparent optimality. He pushed until he could prove the code was best.

Hamming says luck favors the prepared mind. Describe a problem in your own technical domain where being prepared in an adjacent area gave you an advantage others lacked. What was the adjacent skill, and how did it transfer?