un — Geometry of Unreliable Data

un

guest

1 / ?

back to lessons

Mean, Variance, and Bias

Every measurement x_i of a true value μ can be written as: x_i = μ + β + ε_i, where β is the systematic error (bias, constant across measurements) and ε_i is the random error (different for each measurement, drawn from a distribution with mean 0).

Random error: E[ε_i] = 0, Var[ε_i] = σ². The sample mean x̄ = (1/n) Σ x_i has expected value μ + β and variance σ²/n. As n → ∞, x̄ → μ + β (not μ). The random error goes to zero; the bias does not.

Systematic error: β ≠ 0, constant. The mean of any number of measurements is μ + β. To remove bias, you need calibration (an independent measurement of β), not more repetitions.

Geometrically: imagine the distribution of measurements as a bell curve. Random error controls the width (variance). Systematic error controls the location of the center (the mean is shifted from the true value by β).

The stated uncertainty in a measurement is usually an estimate of σ (random error only). If β is large and undetected, the stated uncertainty is meaningless — it quantifies the noise in a biased instrument.

Error Propagation: Uncertainty Through Functions

Bias vs Variance Calculation

A laboratory measures the gravitational constant g. Their instrument has a systematic calibration error of β = +0.05 m/s². Their random measurement error has standard deviation σ = 0.02 m/s². They take n = 100 measurements.

True value: g = 9.80 m/s².

Calculate: (a) the expected value of their sample mean x̄, (b) the standard error of their sample mean (uncertainty in x̄ due to random error only), (c) the 95% confidence interval they would report (assuming they are unaware of the bias), and (d) whether the true value lies within that interval. Show all calculations.

How Errors Move Through Calculations

When you compute a quantity z = f(x, y) from measured quantities x and y, their measurement errors propagate into z.

Error propagation formula (first-order Taylor expansion):

σ²_z ≈ (∂f/∂x)² σ²_x + (∂f/∂y)² σ²_y

(This assumes x and y errors are independent. If correlated, add 2 · (∂f/∂x)(∂f/∂y) · Cov(x,y).)

Key insight: the partial derivatives act as amplifiers. If ∂f/∂x is large, small errors in x produce large errors in z.

This means choosing a calculation method that minimizes the partial derivatives is a real engineering objective — not just algorithmic convenience. Hamming was acutely aware of this in his numerical analysis work.

Propagation Through a Product

You measure two lengths: L₁ = 10.0 m ± 0.1 m (σ₁ = 0.1) and L₂ = 5.0 m ± 0.2 m (σ₂ = 0.2). You compute area A = L₁ × L₂.

Using the propagation of uncertainty formula, calculate: (a) the expected value of A, (b) σ_A using the formula σ²_A = (∂A/∂L₁)² σ₁² + (∂A/∂L₂)² σ₂², and (c) the relative uncertainty σ_A/A. Show that the relative uncertainty in A equals √[(σ₁/L₁)² + (σ₂/L₂)²]. Verify this numerically.

When Data Fits Too Well

Chi-squared goodness-of-fit test: given n observations O_i and model predictions E_i, compute:

χ² = Σ (O_i − E_i)² / E_i

If the model is correct and measurements have variance E_i, the expected value of χ² is approximately ν = (number of data points) − (number of fitted parameters), called the degrees of freedom.

The reduced chi-squared χ²/ν should be approximately 1.0 if the data fits the model with the expected amount of scatter.

- χ²/ν >> 1: data varies more than expected — model is wrong, or uncertainties are underestimated.

- χ²/ν << 1: data varies less than expected — suspiciously clean.

The suspicious case: if your measurements have σ = 0.1 but the data all fall within ±0.01 of the model curve, someone has selectively kept the 'good' measurements. This is confirmatory bias: discarding data that disagrees and retaining data that agrees.

Hamming cites Millikan's oil drop experiment: the Nobel Prize-winning measurement of the electron charge. Later analysis of Millikan's laboratory notebooks revealed he applied undocumented judgment to discard 'outlier' measurements — and the retained measurements fit suspiciously well.

Compute and Interpret Reduced Chi-Squared

A student fits a linear model y = ax + b to 10 data points, estimating 2 parameters (a and b). The stated measurement uncertainty for each point is σ = 0.5. The residuals (O_i − E_i) from the fit are: 0.08, −0.12, 0.05, −0.09, 0.11, −0.07, 0.04, −0.03, 0.10, −0.06.

Compute χ², the degrees of freedom ν, and the reduced chi-squared χ²/ν. Then interpret the result: does this data fit the model well, poorly, or suspiciously well? What would you do next as a data analyst?