un

guest
1 / ?
back to lessons

Logarithmic Scale of Factorials

Stirling's approximation converts a product into a sum, which is the fundamental move that makes large-n mathematics tractable:

ln(n!) ≈ n·ln(n) − n + 0.5·ln(2πn)

This formula arises from approximating the sum Σ ln(k) for k=1..n by the integral of ln(x), then applying the trapezoid rule to bound the error.

Why It Matters Geometrically

The n-dimensional sphere volume formula involves Γ(n/2 + 1), which for integer n equals (n/2)! or products of half-integers. Stirling lets us estimate these for large n without computing each value directly.

Stirling's approximation gives log(n!) ≈ n·log(n) − n·log(e) in base-10 notation, useful for order-of-magnitude estimates.

For n = 10: ln(10!) ≈ 10·2.303 − 10 + 0.5·ln(62.83) ≈ 23.03 − 10 + 2.08 = 15.10 (true: 15.104).

For n = 100: ln(100!) ≈ 100·4.605 − 100 + 0.5·ln(628.3) ≈ 460.5 − 100 + 3.24 = 363.7 (true: 363.74).

Stirling at n=20

A direct computation: ln(20) ≈ 2.996. ln(2π·20) = ln(125.66) ≈ 4.833.

Compute ln(20!) using Stirling's log formula. Then estimate 20! by taking e^(your answer). Compare to the true value 20! = 2,432,902,008,176,640,000 ≈ 2.433 × 10^18. Show all three terms.

The Volume Formula

The volume of an n-dimensional sphere of radius r:

V_n(r) = C_n · r^n where C_n = π^(n/2) / Γ(n/2 + 1)

The C_n values for small n follow a pattern using Γ(1/2) = √π and the reduction formula:

- n=1: C_1 = π^(1/2)/Γ(3/2) = √π/(√π/2) = 2

- n=2: C_2 = π^1/Γ(2) = π/1 = π

- n=3: C_3 = π^(3/2)/Γ(5/2) = π^(3/2)/(3√π/4) = 4π/3

- n=4: C_4 = π²/Γ(3) = π²/2

- n=5: C_5 = π^(5/2)/Γ(7/2) = π^(5/2)/(15√π/8) = 8π²/15

Notice: C_n peaks near n=5 (≈ 5.264) then decreases. For large n, C_n → 0.

Unit Sphere Volume vs Dimension

Maximum at n=5

C_5 = 8π²/15. With π² ≈ 9.870:

C_5 = 8·9.870/15 = 78.96/15 ≈ 5.264

To verify this is a maximum: C_6 = π³/6 ≈ 31.006/6 ≈ 5.168. So C_6 < C_5 — the peak occurred at n=5.

Verify that C_4 = π²/2 ≈ 4.935. Then compute C_5/C_4 and C_6/C_5. Do these ratios confirm a peak between n=4 and n=6? Show your work.

Fraction of Volume in Corners

The corner paradox quantified: what fraction of an n-dimensional unit hypercube [−1,1]^n lies outside the inscribed sphere of radius 1?

Corner fraction = 1 − C_n / 2^n

Corner Paradox

| n | C_n | 2^n | Sphere fraction | Corner fraction | |---|---|---|---|---| | 2 | 3.14 | 4 | 78.5% | 21.5% | | 3 | 4.19 | 8 | 52.4% | 47.6% | | 4 | 4.93 | 16 | 30.8% | 69.2% | | 5 | 5.26 | 32 | 16.4% | 83.6% | | 6 | 5.17 | 64 | 8.1% | 91.9% | | 10 | 2.55 | 1024 | 0.25% | 99.75% |

For n=8, C_8 = π⁴/24 ≈ 4.059. Compute the corner fraction. Then interpret: if you draw 1000 uniform random samples from the 8-dimensional unit hypercube, how many do you expect to land inside the inscribed sphere?

Implications for Optimization

The corner paradox has direct consequences for optimization in high-dimensional spaces:

Random search fails. A random point in n-dimensional parameter space almost certainly lands in a corner — far from the origin, with extreme parameter values. If good solutions cluster near moderate parameter values, random search will almost never find them.

Gradient descent succeeds. By following the local gradient, you navigate the geometry systematically rather than sampling it blindly. The curse of dimensionality hits random methods; structured methods adapt.

Distance concentrates. In high dimensions, all pairwise distances between random points concentrate around the same value: they all become approximately √(2n/3) for points uniform in [0,1]^n. Nearest-neighbor methods break down because 'nearest' and 'farthest' become indistinguishable.

Hamming's prescription: understand the geometry before trusting your intuition. In high-dimensional spaces, the geometry is counterintuitive, and the math is the only reliable guide.

A neural network has 10,000 weight parameters. Each weight initialized uniformly in [−1, 1]. The corner paradox tells us that essentially none of these initialization points lie inside the unit 10,000-dimensional sphere. Yet neural networks train successfully from random initialization. What does this tell us about the geometry of the loss landscape, and what breaks the analogy between 'good initialization' and 'unit sphere'?