un — Geometry of PAC Learning

un

guest

1 / ?

back to lessons

PAC as a Two-Axis Plane

Two Axes, One Sample-Count Surface

Plot ε on horizontal axis (error tolerance, range 0 to 1). Plot δ on vertical axis (probability of failure, range 0 to 1). Each point in this unit square corresponds to a (ε, δ) demand pair.

PAC ε δ Budget Plane

Above each point sits a sample-count value m(ε, δ) = (1/ε)(ln|H| + ln(1/δ)). Together, those m values trace a curved surface above our square. Stricter demands (smaller ε, smaller δ) pull our surface upward; looser demands flatten it.

Iso-Sample Contour Lines

Project our surface back down to our plane as iso-m contours. Every (ε, δ) pair on a single contour requires our same sample budget. Move along a contour to trade error tolerance for confidence at fixed cost.

Halving an Axis

Halving ε along our horizontal moves m up by factor 2 (linear in 1/ε). Halving δ along our vertical moves m up by ln(2) ≈ 0.69 (logarithmic in 1/δ). Geometry tells us: error tolerance carries a steeper cost than confidence.

Reading the Budget Surface

We sit at point (ε = 0.05, δ = 0.05) for hypothesis class |H| = 10⁶. Sample requirement m₀ = (1/0.05)(ln(10⁶) + ln(20)) = 20 × (13.8 + 3.0) = 336.

(a) Compute new sample requirement m₁ at (ε = 0.025, δ = 0.05) — half error, same confidence. (b) Compute m₂ at (ε = 0.05, δ = 0.025) — same error, half failure-probability. (c) Which axis carries the steeper sample cost, & by what ratio?

Dichotomies on Point Clouds

What Shattering Looks Like

Place n points in our plane. Pick a hypothesis class (linear classifiers = straight lines). Count how many distinct ways our class can label those n points (+/− on each side of a line). Call this count Π_H(n).

VC Shattering Three Points

If Π_H(n) = 2ⁿ, our class shatters that point set — it can produce every possible labeling. If Π_H(n) < 2ⁿ, some labelings cannot occur.

Three Points in General Position

Linear classifiers in ℝ² shatter any 3 non-collinear points. 2³ = 8 labelings; all 8 achievable by some line. Pick any 3 points; for each ±/± labeling, draw a line that separates positives from negatives.

Four Points Refuse to Shatter

Place 4 points at corners of a square. Try to label diagonal pair as positive & anti-diagonal pair as negative (XOR labeling). No straight line separates them. So Π_H(4) ≤ 14 < 16 = 2⁴.

VC Dimension as Maximum Shatter Size

VC(linear ℝ²) = 3. We can shatter 3 points; we cannot shatter 4. VC counts maximum dichotomy capacity of our hypothesis class.

Geometric Intuition

Higher VC = our class draws more elaborate decision boundaries. Linear (VC = d+1 in d dimensions) draws hyperplanes. Polynomials draw curves. Neural networks draw highly folded manifolds. More foldability = more dichotomies = higher VC = higher sample requirement.

Counting Dichotomies

Consider linear classifiers in ℝ² (lines). We have 5 points placed in general position (no 3 collinear, none redundant).

(a) Compute 2⁵ = number of all possible ±/± labelings. (b) Sauer-Shelah lemma bounds Π_H(n) ≤ Σ_{i=0}^{d} C(n, i) where d = VC(H). Apply this with d = 3, n = 5: compute C(5,0) + C(5,1) + C(5,2) + C(5,3). (c) State by what factor lines fall short of full shattering.

Probability Mass on Hypothesis Manifold

Picturing PAC-Bayes

Picture our hypothesis space as a high-dimensional manifold. Each point on this manifold corresponds to one weight configuration of a neural network. Prior P assigns a probability distribution across our manifold (often Gaussian centered at initialization). Posterior Q concentrates probability mass where training data drove our weights.

PAC Bayes Posterior over Hypothesis Space

KL Divergence as Geometric Distance

KL(Q‖P) measures how far Q drifted from P. Geometric reading: how much our posterior cloud moved from prior cloud, weighted by how unlikely each posterior region was under our prior.

Small KL = Q overlaps P heavily. Posterior barely moved. Generalization gap stays small.

Large KL = Q concentrated in regions P assigned little mass to. Posterior moved a lot. Generalization gap grows.

Why This Geometry Matters

Imagine SGD as a search trajectory across our hypothesis manifold. Trajectory ends at a basin of low training loss. PAC-Bayes asks: how wide is this basin?

Wide basin = many neighboring weight configurations also achieve low training loss. Posterior Q can spread over a wide region & still have low risk. KL(Q‖P) stays bounded. Generalization gap small.

Narrow basin = only a thin set of weights achieves low loss. Posterior must concentrate sharply. KL grows. Generalization gap widens.

This connects directly to flat-vs-sharp minima discourse (Hochreiter & Schmidhuber 1997, Keskar et al 2017). Flat minima generalize better because they support wider posteriors with smaller KL.

Reading a Basin Width

Two trained models reach identical training loss but live in different basins:

- Model A: flat basin, posterior spreads over region with KL(Q_A‖P) = 50 nats.

- Model B: sharp basin, posterior concentrates with KL(Q_B‖P) = 500 nats.

Both trained on n = 10,000 examples with empirical risk 0.05, δ = 0.05.

Compute the PAC-Bayes generalization gap √[(KL + ln(2√n/δ)) / 2n] for each model. Then state which generalizes better & give the geometric reason.

A Curve That Falls Where Theory Predicted Rise

Classical U-Curve

Plot model capacity on horizontal axis. Plot test risk on vertical axis. Classical bias-variance theory predicts:

- Low capacity: high bias, high test risk (underfit)

- Mid capacity: low bias + low variance, low test risk (sweet spot)

- High capacity: low bias, high variance, high test risk (overfit)

Result: U-shaped curve. Pick capacity at our bottom.

Double Descent Curve

What Belkin et al (2019) Observed

Past interpolation threshold (capacity where model exactly fits training data with zero error), test risk DROPS again. Curve reads: descent → peak at interpolation → second descent. Two descents, one curve.

Geometric Reading of Second Descent

At interpolation threshold, model has just enough capacity to fit training data — only one (or few) interpolating solutions exist & they tend to be jagged. Generalization suffers because solution chosen is forced.

Past interpolation threshold, MANY interpolating solutions exist. SGD has freedom to pick a smooth one (minimum-norm, low-curvature). Geometric picture: solution manifold becomes wider & flatter. SGD's implicit regularization picks benign solutions from this flat manifold. Test risk falls.

Why Classical Theory Misses This

VC dimension counts solution-set capacity but ignores which solution gets chosen. Classical bound assumes worst-case empirical risk minimizer. Reality: SGD reliably picks our flattest, smoothest interpolating solution. Once we count SOLVER-CHOSEN solutions instead of all solutions, second descent makes sense.

Geometric Take-Home

Capacity matters less than basin geometry. Wide flat basins (post-interpolation) generalize better than narrow sharp ones (at interpolation). Modern theory tries to bound generalization by basin width, not by parameter count.

Locating the Two Descents

On a double descent curve, three regions matter: (1) under-parameterized regime, (2) interpolation peak, (3) over-parameterized regime.

Describe geometrically what happens to (a) the solution manifold's width, (b) basin curvature at chosen minimum, & (c) implicit regularization's role at each of the three regions. State which region modern training (e.g. GPT-class models) operates in & why classical PAC fails to predict its success.

Power-Law Surface in Parameter-Token Space

A 3D Surface

Plot parameters N on one horizontal axis. Plot tokens D on a second horizontal axis. Plot loss L on vertical. Empirical loss carves a power-law surface across this (N, D) plane:

L(N, D) ≈ (Nc/N)^αN + (Dc/D)^αD + L∞

Compute Optimal Training Surface

Surface slopes downward as either N or D grows. Slopes follow log-linear power laws (straight lines in log-log plot). Asymptote L∞ stays positive — irreducible loss our model cannot shrink past.

Compute-Optimal Ridge

Fix total compute budget C ∝ N × D (parameters × tokens, roughly). Slice our surface along this constraint. Slice trace cuts a 2D curve through 3D surface. Bottom of this curve = compute-optimal point.

Chinchilla (Hoffmann et al 2022) computed this bottom analytically: D_opt ≈ 20 × N. Curve along compute budget = a ridge. Walking along ridge: equal compute, decreasing loss. Walking off ridge (more parameters than 20× tokens, or fewer): wasted compute.

Geometric Reading of GPT-3 vs Chinchilla

GPT-3: 175B params, 300B tokens. Chinchilla-optimal would want 175B × 20 = 3500B tokens. GPT-3 sits far off the compute-optimal ridge in our parameter-heavy direction. Chinchilla itself: 70B params trained on 1400B tokens. 1400 / 70 = 20 — exactly on ridge. Chinchilla beat GPT-3 with less than half its parameter count by sitting on geometric optimum.

Data Wall as Vertical Plane

Public web ~10¹³ usable tokens. This plots as a vertical wall at D = 10¹³ on our parameter-token plane. Beyond this wall, compute-optimal training requires N ≤ D / 20 = 5 × 10¹¹ params. Walls beyond N = 5 × 10¹¹ either run under-trained (off-ridge) or require synthetic / multimodal / RL data to push wall outward.

Walking the Compute-Optimal Ridge

We sit at GPT-3 coordinates: N = 175B params, D = 300B tokens. Compute proxy C = N × D = 5.25 × 10²² param-tokens.

(a) Compute Chinchilla-optimal (N*, D*) for our same compute budget by solving D* = 20 × N* with N* × D* = 5.25 × 10²². (b) State by what factor N* differs from GPT-3's 175B. (c) Describe geometrically what 'walking from GPT-3's point to (N*, D*) along the constant-compute curve' looks like on the parameter-token surface.

Beta Posterior Tightening into a Needle

A Probability Density on [0, 1]

Beta(α, β) is a probability density over the unit interval [0, 1]. Variable: ε = true error rate. Shape: α controls mass on high-ε side; β controls mass on low-ε side.

Beta Posterior Tightening

Beta(1, 1): uniform — no information, flat density across [0, 1].

Beta(α, β) with α + β large: concentrated peak at α / (α + β).

Width of Beta peak shrinks as 1/√(α+β). Adding 100 observations to our prior tightens the peak by factor √100 = 10. Adding 10000 observations tightens by √10000 = 100.

Geometric Reading of an Audit Run

Start: Beta(1, 1) = flat rectangle on [0, 1]. Maximum uncertainty about ε.

After 200 queries with 8 falsifications: Beta(9, 193). Mean = 9/202 ≈ 0.045. Density now a sharp hump centered near 0.045 with characteristic width σ ≈ 0.014.

After 2000 queries with 80 falsifications: Beta(81, 1921). Mean still ≈ 0.045, but width σ ≈ 0.0046. Hump three times sharper.

After 200,000 queries with 8000 falsifications: Beta(8001, 192,001). Mean ≈ 0.040, width σ ≈ 0.0004. Hump becomes a needle.

Geometric Convergence to a Point Mass

As n → ∞, Beta posterior collapses to a Dirac delta at true ε. Geometry: rectangle → wide hump → narrow hump → needle → point. Each query tightens our distribution by 1/√n.

Why This Beats Theoretical PAC Bounds

Theoretical PAC bounds give a STATIC ε estimate based on hypothesis class size. Beta posterior gives a DYNAMIC ε estimate that tightens with every observation, calibrated against your real-world distribution. Theoretical bound = a guarantee under worst-case assumptions. Empirical audit = a measurement of actual reality.

How Many Queries to Halve the Credible Interval?

We currently sit at Beta(9, 193) after 200 queries: mean ε ≈ 0.045, σ ≈ 0.014. We want to halve the credible-interval width to σ ≈ 0.007.

(a) Recall σ scales as 1/√(α+β). To halve σ, by what factor must (α + β) grow? (b) Currently α + β = 202. Compute target α + β. (c) Compute additional queries needed (assuming constant ~4% falsification rate). (d) Geometrically describe what halving σ does to the shape of the Beta density.