un — Hamming Ch 29: You Get What You Measure

un

guest

1 / ?

back to lessons

How IQ Gets Its Normal Distribution

Hamming opens Chapter 29 with a careful dissection of IQ testing.

The claim: intelligence follows a normal distribution in the population. Measurement: plot scores on a cumulative probability scale (probability paper). The scores fall on a straight line — indicating a normal distribution.

The problem Hamming identifies: this is not a discovery. It is a construction. The IQ test is calibrated by taking the raw scores and applying a monotone transformation that forces the cumulative distribution onto the normal probability scale. Then the resulting scores are declared to measure intelligence, which is defined as what the calibrated test measures.

Result: intelligence, defined as what this test measures, is normally distributed. Of course it is — it was designed to be. The normal distribution is not a property of intelligence in the world; it is a property of the calibration procedure.

Hamming's generalization: you get what you measure. The instrument, the calibration procedure, and the definition are not independent. They form a closed loop. What the instrument measures becomes the definition of what is real.

His calculus exam example: he can produce almost any distribution of grades he wants by choosing the difficulty distribution of questions. A uniformly hard exam produces a bimodal distribution (students either know it or do not). A mixed exam produces a bell curve. The distribution is an artifact of the test design, not a discovery about the students.

Goodhart's Law: When Metrics Become Targets

Finding the Circular Loop

Hamming's analysis reveals a three-step circular definition:

1. Design an instrument and calibration procedure.

2. Define the construct as 'what this instrument measures.'

3. Report that the construct has the distributional property designed into the calibration.

Find a measurement or classification system in a field you know where the same circular loop operates: the instrument or procedure is designed to produce a certain outcome, and then that outcome is reported as a discovery about the world. Identify the three steps (instrument, definition, reported discovery) and explain how the circularity could mislead someone who did not know the calibration history.

When a Measure Becomes a Target

Hamming's formulation, before Goodhart named it: when you use a measure as a target, it ceases to be a valid measure. The act of targeting corrupts the metric.

The mechanism: before targeting, the metric correlates with the underlying value. After targeting, rational actors optimize the metric directly. The correlation breaks because the easiest way to improve the metric is often to decouple it from the underlying value.

Hamming's cases:

- Body count in Vietnam: used as a measure of military progress. Soldiers optimized body count by counting unverifiable objects. The metric rose; military progress did not.

- GNP growth: used as a measure of economic wellbeing. GNP growth can be achieved by producing things with negative value (pollution cleanup, military buildup, prison construction). The metric divorced from wellbeing.

- Test scores: used as a measure of learning. Schools teach to the test. Scores rise; understanding of the underlying subject may not.

Hamming's solution: (1) change the metric regularly, before people fully optimize it; (2) use multiple metrics simultaneously — it is harder to optimize all of them at once; (3) never rely on a single metric for any important decision.

Identify the Corruption Mechanism

A software organization measures developer productivity by counting lines of code (LOC) written per week. Initially, LOC correlates with productivity — active developers write more code than inactive ones.

Describe specifically how the LOC metric gets corrupted when it is used as a performance target. Name at least three concrete behaviors that rational developers would adopt to optimize LOC without improving productivity. Then describe a multi-metric alternative that would be harder to corrupt, and explain why it is harder.

The Dynamic Range Problem

Hamming raises a subtle measurement problem: rating scales have dynamic range, and most people do not use it.

Example: a 1-10 scale where 5 is average. Most raters use 4, 5, and 6, never venturing to 1 or 9. The dynamic range of their ratings is effectively 3 (from 4 to 6), even though the scale provides 10.

The consequence: a rater who uses the full range has 3× the influence on an averaged rating as one who compresses to the middle. If you rate something you dislike as 2 (full range) while the other rater gives what they like a 6 (compressed range), the average is 4 — your dislike outweighs their like even though both have equal voice in the design of the rating system.

Hamming's information theory connection: the entropy (average surprise) of a distribution is maximized when the distribution is uniform. A rating scale where all grades are used equally communicates the maximum information. A scale where most ratings cluster at 5 communicates very little — the ratings carry nearly no information.

His practical advice: use the entire dynamic range of any scale you are assigned. If you are given a scale from 1 to 10, do not treat it as 1 to 6. Doing so reduces your influence and reduces the information content of your ratings.

Information and Dynamic Range

Two professors grade on a 0-100 scale. Professor A uses only the range 70-90 (compresses to 20 points). Professor B uses the full range 0-100 (uses 100 points). Assume each professor's grade distribution is uniform within their used range.

Using the entropy formula H = log₂(n) for a uniform distribution over n equally-probable outcomes, calculate the information content (in bits) of a single grade from each professor. How many times more information does Professor B's grade carry than Professor A's? What does this mean for a graduate school admission committee that receives both professors' grades?