How IQ Gets Its Normal Distribution
Hamming opens Chapter 29 with a careful dissection of IQ testing.
The claim: intelligence follows a normal distribution in the population. Measurement: plot scores on a cumulative probability scale (probability paper). The scores fall on a straight line — indicating a normal distribution.
The problem Hamming identifies: this is not a discovery. It is a construction. The IQ test is calibrated by taking the raw scores and applying a monotone transformation that forces the cumulative distribution onto the normal probability scale. Then the resulting scores are declared to measure intelligence, which is defined as what the calibrated test measures.
Result: intelligence, defined as what this test measures, is normally distributed. Of course it is — it was designed to be. The normal distribution is not a property of intelligence in the world; it is a property of the calibration procedure.
Hamming's generalization: you get what you measure. The instrument, the calibration procedure, and the definition are not independent. They form a closed loop. What the instrument measures becomes the definition of what is real.
His calculus exam example: he can produce almost any distribution of grades he wants by choosing the difficulty distribution of questions. A uniformly hard exam produces a bimodal distribution (students either know it or do not). A mixed exam produces a bell curve. The distribution is an artifact of the test design, not a discovery about the students.
Finding the Circular Loop
Hamming's analysis reveals a three-step circular definition:
1. Design an instrument and calibration procedure.
2. Define the construct as 'what this instrument measures.'
3. Report that the construct has the distributional property designed into the calibration.
When a Measure Becomes a Target
Hamming's formulation, before Goodhart named it: when you use a measure as a target, it ceases to be a valid measure. The act of targeting corrupts the metric.
The mechanism: before targeting, the metric correlates with the underlying value. After targeting, rational actors optimize the metric directly. The correlation breaks because the easiest way to improve the metric is often to decouple it from the underlying value.
Hamming's cases:
- Body count in Vietnam: used as a measure of military progress. Soldiers optimized body count by counting unverifiable objects. The metric rose; military progress did not.
- GNP growth: used as a measure of economic wellbeing. GNP growth can be achieved by producing things with negative value (pollution cleanup, military buildup, prison construction). The metric divorced from wellbeing.
- Test scores: used as a measure of learning. Schools teach to the test. Scores rise; understanding of the underlying subject may not.
Hamming's solution: (1) change the metric regularly, before people fully optimize it; (2) use multiple metrics simultaneously — it is harder to optimize all of them at once; (3) never rely on a single metric for any important decision.
Identify the Corruption Mechanism
A software organization measures developer productivity by counting lines of code (LOC) written per week. Initially, LOC correlates with productivity — active developers write more code than inactive ones.
The Dynamic Range Problem
Hamming raises a subtle measurement problem: rating scales have dynamic range, and most people do not use it.
Example: a 1-10 scale where 5 is average. Most raters use 4, 5, and 6, never venturing to 1 or 9. The dynamic range of their ratings is effectively 3 (from 4 to 6), even though the scale provides 10.
The consequence: a rater who uses the full range has 3× the influence on an averaged rating as one who compresses to the middle. If you rate something you dislike as 2 (full range) while the other rater gives what they like a 6 (compressed range), the average is 4 — your dislike outweighs their like even though both have equal voice in the design of the rating system.
Hamming's information theory connection: the entropy (average surprise) of a distribution is maximized when the distribution is uniform. A rating scale where all grades are used equally communicates the maximum information. A scale where most ratings cluster at 5 communicates very little — the ratings carry nearly no information.
His practical advice: use the entire dynamic range of any scale you are assigned. If you are given a scale from 1 to 10, do not treat it as 1 to 6. Doing so reduces your influence and reduces the information content of your ratings.
Information and Dynamic Range
Two professors grade on a 0-100 scale. Professor A uses only the range 70-90 (compresses to 20 points). Professor B uses the full range 0-100 (uses 100 points). Assume each professor's grade distribution is uniform within their used range.