Probably Approximately Correct
Valiant's Question (1984)
Leslie Valiant asked a deceptively simple question: what does it mean for a machine to learn? Not 'can it memorize?' Not 'can it predict perfectly?' Instead: can it learn approximately well, with high probability, from a finite sample, in polynomial time?
That framing won him a Turing Award in 2010 & founded computational learning theory.
Two Knobs
ε (epsilon) — error tolerance. Approximately correct means error ≤ ε. Smaller ε = stricter learning.
δ (delta) — confidence parameter. Probably means probability of success ≥ 1 − δ. Smaller δ = more confidence required.
Definition
A concept class C counts as PAC-learnable if some algorithm exists where, for any data distribution D & any target concept c ∈ C, given m samples drawn from D & labeled by c, our algorithm returns hypothesis h satisfying:
P[error(h) ≤ ε] ≥ 1 − δ
with m polynomial in 1/ε, 1/δ, & concept size.
In English: feed it enough examples & it gets close enough often enough, & 'enough' doesn't blow up exponentially.
Sample Complexity for Finite Hypothesis Spaces
If our hypothesis space H holds finitely many candidate hypotheses, classical analysis gives:
m ≥ (1/ε)(ln|H| + ln(1/δ))
Read that formula as a budget. Halving ε doubles required samples. Halving δ adds a constant. Doubling |H| adds ln(2) samples — log scaling, beautifully tame growth.
Sizing a Sample for a Hypothesis Class
A spam filter chooses among |H| = 1,000,000 candidate rule sets. We want error ε ≤ 0.05 with confidence 1 − δ = 0.99 (so δ = 0.01).
Shattering & VC Dimension
When Hypothesis Spaces Run Infinite
The bound m ≥ (1/ε)(ln|H| + ln(1/δ)) breaks when |H| = ∞. Most useful hypothesis classes (linear classifiers in ℝᵈ, decision trees, neural nets) hold infinitely many candidates. Vapnik & Chervonenkis solved this around 1971 with a richer measure of complexity: VC dimension.
Shattering
A hypothesis class H shatters a set of n points if H can produce every possible labeling of those n points (all 2ⁿ dichotomies). Linear classifiers in ℝ² shatter any 3 points in general position: for each of 8 possible +/− labelings of those 3 points, some line separates them correctly.
But linear classifiers in ℝ² cannot shatter 4 points arranged in an XOR configuration. No straight line separates the diagonal pair from the anti-diagonal pair.
VC Dimension
VC(H) = the largest n such that H shatters some n-point set. For 2D linear classifiers: VC = 3. For axis-aligned rectangles in 2D: VC = 4. For neural networks with W weights: VC ≤ O(W² log W) (Bartlett & Anthony 1999).
Sample Complexity with VC Dimension
Replace ln|H| in our PAC bound with VC dimension d:
m = O((d/ε) log(1/ε) + (1/ε) log(1/δ))
VC dimension acts as our 'effective complexity' of an infinite class. A hypothesis class with low VC dimension generalizes from few samples; a class with high VC dimension demands more data.
VC Dimension as Effective Hypothesis Count
A neural network has W = 10⁹ weights. By the Bartlett-Anthony bound, VC ≤ O(W² log W). Approximate VC ≈ W² log₂ W.
Dropping Realizability, Posteriors over Hypotheses
Two Important Generalizations
Classical PAC assumes our target concept c lives inside our hypothesis class H. Real data carries noise, mislabels, & concepts our class cannot represent. Two extensions handle this.
Agnostic PAC
Drop our assumption that c ∈ H. Now compete against our best-in-class hypothesis h* = argmin_{h ∈ H} risk(h). Bound shape changes: instead of getting close to perfect, we get close to best-possible-within-H.
Agnostic bound: P[risk(h) ≤ risk(h*) + ε] ≥ 1 − δ with sample complexity m = O(1/ε² × (VC(H) + log(1/δ))). Note ε² instead of ε in our denominator: agnostic learning needs more samples for our same precision.
PAC-Bayes (McAllester 1999)
Move from picking a single hypothesis to picking a distribution over hypotheses. Start with prior P. Observe data. Derive posterior Q. PAC-Bayes bounds expected risk under Q:
𝔼_{h~Q}[risk(h)] ≤ 𝔼_{h~Q}[empirical_risk(h)] + √[(KL(Q‖P) + ln(2√n/δ)) / 2n]
KL(Q‖P) measures how far our posterior moved from our prior. Two interpretations:
1. Compression view. A posterior close to its prior (small KL) generalizes well — small information cost = small generalization gap.
2. Bayesian view. Strong prior + weak data update = tight bound; weak prior + heavy data update = looser bound.
Why PAC-Bayes matters. Empirical PAC-Bayes bounds (Catoni 2007, Dziugaite & Roy 2017) give numerically meaningful generalization guarantees for real neural networks where classical PAC bounds run vacuous. They remain our tightest theoretical handle on overparameterized generalization.
Reading a PAC-Bayes Bound
Suppose we train a network on n = 50,000 labeled examples. After training, our posterior Q over weights has KL(Q‖P) = 200 nats relative to a Gaussian prior P. Empirical risk under Q runs at 0.10. Set δ = 0.05.
Overparameterization & Double Descent
Classical PAC's Predicted Disaster
Classical PAC predicts: more parameters than samples = catastrophic overfitting. Learn perfectly on training data, generalize randomly on test data. VC bound predicts memorization without learning.
Modern neural networks routinely have 10× to 1000× more parameters than training samples & still generalize. This shouldn't happen by classical theory. It happens anyway.
The U-Curve We Were Taught
Classical bias-variance: as model capacity grows, training error falls monotonically; test error first falls (less underfit), reaches a minimum, then rises (overfit). U-shape predicted by VC theory.
What Actually Happens — Double Descent
Belkin et al (2019) showed test error follows our classical U-curve up to interpolation threshold (capacity = #samples), then DROPS AGAIN past that threshold. Massively overparameterized models generalize better than just-large-enough models.
Why Classical PAC Misses This
1. Distribution-free assumption is too pessimistic. Real data has structure (low intrinsic dimension, manifold geometry). PAC bounds hold over worst-case distributions; real distributions exploit structure SGD also exploits.
2. Implicit regularization. SGD on overparameterized networks finds minimum-norm or minimum-complexity interpolating solutions, not arbitrary ones. Classical theory assumes worst-case empirical risk minimizer; real algorithms pick benign ones.
3. Hypothesis class definition is wrong. Effective hypothesis class explored by SGD is far smaller than nominal weight space. PAC-Bayes posteriors capture this; VC dimension does not.
Modern theoretical work (Bartlett, Long, Lugosi, Tsigler 2020 on benign overfitting; Nakkiran et al 2020 on double descent) is building post-PAC generalization theory that accounts for our overparameterized regime.
Diagnosing a Failure of Classical PAC
A research team trains a 1-billion-parameter network on 100,000 labeled examples. Classical PAC predicts vacuous bounds. Empirically, test accuracy reaches 95%.
Kaplan, Chinchilla, & Sizing Automated General Intelligence
From Bounds to Empirical Scaling Laws
Classical PAC predicts dataset size theoretically & runs vacuous. Empirical scaling laws predict dataset size from observation & actually work. They have replaced PAC as our practical sizing tool for large models.
Kaplan et al (2020)
Loss follows a power law in parameters N, dataset tokens D, & compute C:
L(N) ≈ (Nc/N)^αN L(D) ≈ (Dc/D)^αD L(C) ≈ (Cc/C)^αC
Doubling parameters drops loss by a fixed multiplicative factor. Doubling data drops loss by another fixed factor. These exponents (αN, αD, αC) measure & predict frontier behavior across many orders of magnitude.
Hoffmann et al 2022 (Chinchilla)
Chinchilla showed Kaplan's exponents over-weighted parameters relative to data. Compute-optimal training requires roughly 20 tokens per parameter:
D_opt ≈ 20 × N
GPT-3 (175B params) trained on ~300B tokens — vastly under-trained by Chinchilla math. A compute-optimal 175B-parameter model wants ~3.5 trillion tokens.
The Data Wall
Scaling our parameter count requires scaling our token count proportionally. Public web crawl yields ~10¹³ useful tokens. A hypothetical 10¹⁵-parameter automated general intelligence would want ~2×10¹⁶ tokens — three orders of magnitude beyond available web data.
This is our data wall. Scaling laws predict we hit a corpus shortage long before we hit a compute shortage. Synthetic data, multimodal corpora (video, audio, sensor streams), & RL-from-environment all aim to push past it.
Classical PAC told us (incorrectly) that we needed 10²¹ samples. Scaling laws tell us (calibrated to reality) that we need 2×10¹⁶. Both numbers exceed available text. Frontier work today wrestles with closing that gap.
Estimating an Automated General Intelligence Corpus
Suppose a frontier lab proposes a 10¹⁴-parameter model & claims it will reach automated general intelligence. We want to size its training corpus by Chinchilla's rule.
From Theoretical Bounds to Production Measurements
Stop Computing Bounds; Start Measuring Them
Classical PAC bounds run vacuous. PAC-Bayes bounds run tight under assumptions hard to verify. A practical alternative: measure ε empirically on your real distribution & update it as your system runs.
Idea 1 — Make Coverage as Empirical PAC
A make coverage target runs N held-out questions through your query system & measures two rates:
- cache_hit_rate — fraction your system found a known answer for
- i_dont_know_rate — fraction your system honestly deferred
Each measurement = a Bernoulli trial. From observed counts, compute a Wilson confidence interval on true rate. Example: N = 1000 queries, observed i_dont_know_rate = 0.20 → 95% CI ≈ [0.177, 0.226]. Upper bound 0.226 functions exactly like a PAC ε at δ = 0.05, derived from real data on real distribution rather than worst-case theoretical analysis.
Classical PAC vocabulary applies (ε, δ, confidence). Different machinery (binomial concentration vs. VC theory). Tighter result because real distributions carry exploitable structure.
Idea 2 — PAC-Bayes Audit via Falsification Events
Treat each falsification (system answer demonstrably wrong) as evidence updating posterior over true error rate ε:
1. Prior: ε ~ Beta(α₀, β₀). Pick weak prior, e.g., Beta(1, 1) = uniform.
2. Each query produces a Bernoulli outcome: falsified (1) or held (0).
3. Posterior after k falsifications in n queries: Beta(α₀ + k, β₀ + n − k).
4. Posterior mean: (α₀ + k) / (α₀ + β₀ + n) → empirical falsification rate as n → ∞.
5. 95% credible interval on ε tightens as 1/√n.
What This Buys Us
- Real-world ε₀ estimate from actual deployment, not worst-case theory
- Anytime-valid alarm: when posterior credible interval crosses contract threshold, page someone
- Monotone confidence: more queries observed → tighter CI → stronger guarantee
Cousin techniques: conformal prediction with online recalibration (Vovk), sequential probability ratio tests (Wald), online PAC-Bayes (Haddouche & Guedj 2022). Same family, different mathematical machinery.
Computing a Beta Posterior on Falsifications
Our team runs a coverage audit on a production query system. Prior on true error rate ε: Beta(1, 1) (uniform). After 200 held-out queries we observed 8 falsifications.