English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

ضيف
1 / ?

Probably Approximately Correct

Valiant's Question (1984)

Leslie Valiant asked a deceptively simple question: what does it mean for a machine to learn? Not 'can it memorize?' Not 'can it predict perfectly?' Instead: can it learn approximately well, with high probability, from a finite sample, in polynomial time?


That framing won him a Turing Award in 2010 & founded computational learning theory.


PAC ε δ Budget Plane


Two Knobs


ε (epsilon) — error tolerance. Approximately correct means error ≤ ε. Smaller ε = stricter learning.


δ (delta) — confidence parameter. Probably means probability of success ≥ 1 − δ. Smaller δ = more confidence required.


Definition

A concept class C counts as PAC-learnable if some algorithm exists where, for any data distribution D & any target concept c ∈ C, given m samples drawn from D & labeled by c, our algorithm returns hypothesis h satisfying:


P[error(h) ≤ ε] ≥ 1 − δ


with m polynomial in 1/ε, 1/δ, & concept size.


In English: feed it enough examples & it gets close enough often enough, & 'enough' doesn't blow up exponentially.


Sample Complexity for Finite Hypothesis Spaces

If our hypothesis space H holds finitely many candidate hypotheses, classical analysis gives:


m ≥ (1/ε)(ln|H| + ln(1/δ))


Read that formula as a budget. Halving ε doubles required samples. Halving δ adds a constant. Doubling |H| adds ln(2) samples — log scaling, beautifully tame growth.

Sizing a Sample for a Hypothesis Class

A spam filter chooses among |H| = 1,000,000 candidate rule sets. We want error ε ≤ 0.05 with confidence 1 − δ = 0.99 (so δ = 0.01).

Apply the finite-class PAC sample-complexity bound m ≥ (1/ε)(ln|H| + ln(1/δ)) to compute a sufficient sample size m. Show the substitution of all three values (ε, |H|, δ) & approximate ln values to one decimal. Round m up to a whole number.

Shattering & VC Dimension

When Hypothesis Spaces Run Infinite

The bound m ≥ (1/ε)(ln|H| + ln(1/δ)) breaks when |H| = ∞. Most useful hypothesis classes (linear classifiers in ℝᵈ, decision trees, neural nets) hold infinitely many candidates. Vapnik & Chervonenkis solved this around 1971 with a richer measure of complexity: VC dimension.


VC Shattering Three Points


Shattering

A hypothesis class H shatters a set of n points if H can produce every possible labeling of those n points (all 2ⁿ dichotomies). Linear classifiers in ℝ² shatter any 3 points in general position: for each of 8 possible +/− labelings of those 3 points, some line separates them correctly.


But linear classifiers in ℝ² cannot shatter 4 points arranged in an XOR configuration. No straight line separates the diagonal pair from the anti-diagonal pair.


VC Dimension

VC(H) = the largest n such that H shatters some n-point set. For 2D linear classifiers: VC = 3. For axis-aligned rectangles in 2D: VC = 4. For neural networks with W weights: VC ≤ O(W² log W) (Bartlett & Anthony 1999).


Sample Complexity with VC Dimension

Replace ln|H| in our PAC bound with VC dimension d:


m = O((d/ε) log(1/ε) + (1/ε) log(1/δ))


VC dimension acts as our 'effective complexity' of an infinite class. A hypothesis class with low VC dimension generalizes from few samples; a class with high VC dimension demands more data.

VC Dimension as Effective Hypothesis Count

A neural network has W = 10⁹ weights. By the Bartlett-Anthony bound, VC ≤ O(W² log W). Approximate VC ≈ W² log₂ W.

Estimate VC for W = 10⁹. Then plug into the VC sample bound m ≈ (d/ε) ignoring log factors, with ε = 0.01. Compare m to the size of all text on our public internet (~10¹³ tokens). State whether classical PAC predicts our hypothesis class can be PAC-learned from internet-scale data.

Dropping Realizability, Posteriors over Hypotheses

Two Important Generalizations

Classical PAC assumes our target concept c lives inside our hypothesis class H. Real data carries noise, mislabels, & concepts our class cannot represent. Two extensions handle this.


PAC Bayes Posterior over Hypothesis Space


Agnostic PAC

Drop our assumption that c ∈ H. Now compete against our best-in-class hypothesis h* = argmin_{h ∈ H} risk(h). Bound shape changes: instead of getting close to perfect, we get close to best-possible-within-H.


Agnostic bound: P[risk(h) ≤ risk(h*) + ε] ≥ 1 − δ with sample complexity m = O(1/ε² × (VC(H) + log(1/δ))). Note ε² instead of ε in our denominator: agnostic learning needs more samples for our same precision.


PAC-Bayes (McAllester 1999)

Move from picking a single hypothesis to picking a distribution over hypotheses. Start with prior P. Observe data. Derive posterior Q. PAC-Bayes bounds expected risk under Q:


𝔼_{h~Q}[risk(h)] ≤ 𝔼_{h~Q}[empirical_risk(h)] + √[(KL(Q‖P) + ln(2√n/δ)) / 2n]


KL(Q‖P) measures how far our posterior moved from our prior. Two interpretations:


1. Compression view. A posterior close to its prior (small KL) generalizes well — small information cost = small generalization gap.

2. Bayesian view. Strong prior + weak data update = tight bound; weak prior + heavy data update = looser bound.


Why PAC-Bayes matters. Empirical PAC-Bayes bounds (Catoni 2007, Dziugaite & Roy 2017) give numerically meaningful generalization guarantees for real neural networks where classical PAC bounds run vacuous. They remain our tightest theoretical handle on overparameterized generalization.

Reading a PAC-Bayes Bound

Suppose we train a network on n = 50,000 labeled examples. After training, our posterior Q over weights has KL(Q‖P) = 200 nats relative to a Gaussian prior P. Empirical risk under Q runs at 0.10. Set δ = 0.05.

Compute the PAC-Bayes upper bound on expected risk. Show the substitution into √[(KL + ln(2√n/δ)) / 2n]. Round ln(2√n/δ) to a whole number. State whether our bound is meaningful (i.e., predicts true risk < 0.5).

Overparameterization & Double Descent

Classical PAC's Predicted Disaster

Classical PAC predicts: more parameters than samples = catastrophic overfitting. Learn perfectly on training data, generalize randomly on test data. VC bound predicts memorization without learning.


Modern neural networks routinely have 10× to 1000× more parameters than training samples & still generalize. This shouldn't happen by classical theory. It happens anyway.


Double Descent Curve


The U-Curve We Were Taught

Classical bias-variance: as model capacity grows, training error falls monotonically; test error first falls (less underfit), reaches a minimum, then rises (overfit). U-shape predicted by VC theory.


What Actually Happens — Double Descent

Belkin et al (2019) showed test error follows our classical U-curve up to interpolation threshold (capacity = #samples), then DROPS AGAIN past that threshold. Massively overparameterized models generalize better than just-large-enough models.


Why Classical PAC Misses This


1. Distribution-free assumption is too pessimistic. Real data has structure (low intrinsic dimension, manifold geometry). PAC bounds hold over worst-case distributions; real distributions exploit structure SGD also exploits.

2. Implicit regularization. SGD on overparameterized networks finds minimum-norm or minimum-complexity interpolating solutions, not arbitrary ones. Classical theory assumes worst-case empirical risk minimizer; real algorithms pick benign ones.

3. Hypothesis class definition is wrong. Effective hypothesis class explored by SGD is far smaller than nominal weight space. PAC-Bayes posteriors capture this; VC dimension does not.


Modern theoretical work (Bartlett, Long, Lugosi, Tsigler 2020 on benign overfitting; Nakkiran et al 2020 on double descent) is building post-PAC generalization theory that accounts for our overparameterized regime.

Diagnosing a Failure of Classical PAC

A research team trains a 1-billion-parameter network on 100,000 labeled examples. Classical PAC predicts vacuous bounds. Empirically, test accuracy reaches 95%.

Identify three concrete reasons classical PAC fails to predict this success. For each reason, name a phenomenon (distribution structure, implicit regularization, double descent, posterior concentration) & briefly explain why classical PAC misses it. 2-3 sentences per reason.

Kaplan, Chinchilla, & Sizing Automated General Intelligence

From Bounds to Empirical Scaling Laws

Classical PAC predicts dataset size theoretically & runs vacuous. Empirical scaling laws predict dataset size from observation & actually work. They have replaced PAC as our practical sizing tool for large models.


Compute Optimal Training Surface


Kaplan et al (2020)

Loss follows a power law in parameters N, dataset tokens D, & compute C:


L(N) ≈ (Nc/N)^αN L(D) ≈ (Dc/D)^αD L(C) ≈ (Cc/C)^αC


Doubling parameters drops loss by a fixed multiplicative factor. Doubling data drops loss by another fixed factor. These exponents (αN, αD, αC) measure & predict frontier behavior across many orders of magnitude.


Hoffmann et al 2022 (Chinchilla)

Chinchilla showed Kaplan's exponents over-weighted parameters relative to data. Compute-optimal training requires roughly 20 tokens per parameter:


D_opt ≈ 20 × N


GPT-3 (175B params) trained on ~300B tokens — vastly under-trained by Chinchilla math. A compute-optimal 175B-parameter model wants ~3.5 trillion tokens.


The Data Wall

Scaling our parameter count requires scaling our token count proportionally. Public web crawl yields ~10¹³ useful tokens. A hypothetical 10¹⁵-parameter automated general intelligence would want ~2×10¹⁶ tokens — three orders of magnitude beyond available web data.


This is our data wall. Scaling laws predict we hit a corpus shortage long before we hit a compute shortage. Synthetic data, multimodal corpora (video, audio, sensor streams), & RL-from-environment all aim to push past it.


Classical PAC told us (incorrectly) that we needed 10²¹ samples. Scaling laws tell us (calibrated to reality) that we need 2×10¹⁶. Both numbers exceed available text. Frontier work today wrestles with closing that gap.

Estimating an Automated General Intelligence Corpus

Suppose a frontier lab proposes a 10¹⁴-parameter model & claims it will reach automated general intelligence. We want to size its training corpus by Chinchilla's rule.

Apply D_opt ≈ 20 × N to estimate compute-optimal token count for N = 10¹⁴ parameters. Compare to public web (~10¹³ tokens). State by what factor we fall short, & name two strategies frontier labs use to close that gap.

From Theoretical Bounds to Production Measurements

Stop Computing Bounds; Start Measuring Them

Classical PAC bounds run vacuous. PAC-Bayes bounds run tight under assumptions hard to verify. A practical alternative: measure ε empirically on your real distribution & update it as your system runs.


Beta Posterior Tightening


Idea 1 — Make Coverage as Empirical PAC

A make coverage target runs N held-out questions through your query system & measures two rates:


- cache_hit_rate — fraction your system found a known answer for

- i_dont_know_rate — fraction your system honestly deferred


Each measurement = a Bernoulli trial. From observed counts, compute a Wilson confidence interval on true rate. Example: N = 1000 queries, observed i_dont_know_rate = 0.20 → 95% CI ≈ [0.177, 0.226]. Upper bound 0.226 functions exactly like a PAC ε at δ = 0.05, derived from real data on real distribution rather than worst-case theoretical analysis.


Classical PAC vocabulary applies (ε, δ, confidence). Different machinery (binomial concentration vs. VC theory). Tighter result because real distributions carry exploitable structure.


Idea 2 — PAC-Bayes Audit via Falsification Events

Treat each falsification (system answer demonstrably wrong) as evidence updating posterior over true error rate ε:


1. Prior: ε ~ Beta(α₀, β₀). Pick weak prior, e.g., Beta(1, 1) = uniform.

2. Each query produces a Bernoulli outcome: falsified (1) or held (0).

3. Posterior after k falsifications in n queries: Beta(α₀ + k, β₀ + n − k).

4. Posterior mean: (α₀ + k) / (α₀ + β₀ + n) → empirical falsification rate as n → ∞.

5. 95% credible interval on ε tightens as 1/√n.


What This Buys Us


- Real-world ε₀ estimate from actual deployment, not worst-case theory

- Anytime-valid alarm: when posterior credible interval crosses contract threshold, page someone

- Monotone confidence: more queries observed → tighter CI → stronger guarantee


Cousin techniques: conformal prediction with online recalibration (Vovk), sequential probability ratio tests (Wald), online PAC-Bayes (Haddouche & Guedj 2022). Same family, different mathematical machinery.

Computing a Beta Posterior on Falsifications

Our team runs a coverage audit on a production query system. Prior on true error rate ε: Beta(1, 1) (uniform). After 200 held-out queries we observed 8 falsifications.

Compute (a) posterior parameters Beta(α, β) after observing this data; (b) posterior mean of ε; (c) approximate 95% credible interval upper bound using μ + 2σ where σ = √(αβ/((α+β)²(α+β+1))). Then state whether you would ship this system to production if its contract requires ε ≤ 0.10.