un — Grow a Language Model: Coverage Bonus

un

guest

1 / ?

back to lessons

CUDA Reports Doc Indices

A CUDA Trainer Knows Which Document It Sampled

Each training step pulls a sequence from a .btok binary, which packs many documents end to end. CUDA records a doc index alongside loss: step 47213, source=gutenberg, doc=128407, loss=2.81. A proxy collects these reports & maintains a set of unique doc indices seen per source.

From Counts to Coverage

Coverage of a source = unique_docs_seen / n_docs. A few examples:

Source	n_docs	unique seen	coverage
gutenberg	512,000	154,000	30.1%
hermes3-general	67,395	47,176	70.0%
dictionary	88,000	88,000	100.0%
synthetic-chat	1,400	1,400	100.0%

Tiny sources saturate fast. Large sources drift below 50% for weeks. Coverage bonus rewards a bandit for visiting documents it has not yet sampled within a source.

Coverage bonus per source

Bonus Formula

Coverage bonus scales linearly from 1.3x at 0% coverage down to 1.0x at 50% coverage, then flat at 1.0x above 50%:

if coverage < 0.5:
    bonus = 1.0 + 0.3 * (1.0 - coverage / 0.5)
else:
    bonus = 1.0

A source at 0% coverage earns 1.3x; a source at 25% earns 1.15x; a source at 50% drops to 1.0x. Above 50%, no bonus applies.

Compute the Bonus

A run with gutenberg coverage at 30% & hermes3-general coverage at 70%. Compute the coverage bonus multiplier for each source. Show your arithmetic.

Two Distinct Freshness Signals

Same Goal, Different Granularity

ANDREA has two mechanisms that prevent over-training on a single source. They sound similar; they measure different things.

Epoch penalty. Tracks aggregate over-pulling. When lifetime_pulls / n_docs > 1.0, a source has theoretically wrapped past every document at least once. Penalty = 1 / (1 + epochs). A 1.4K-document synthetic-chat source at 5,600 lifetime pulls (epochs = 4) earns penalty 1/5 = 0.2x. Epoch counts persist across restarts; they never decay.

Coverage bonus. Tracks per-document freshness within a source. CUDA reports doc indices; the proxy maintains a set per source. Sources below 50% coverage of unique docs earn up to 1.3x. Coverage rewards exploring a source's tail; epoch penalty punishes exhausting it.

Why Both Matter

Signal	Tracks	Direction	Cap	Persists across restarts
Epoch penalty	aggregate over-pulling	reduces	1/(1+e)	yes
Coverage bonus	per-doc freshness	boosts	1.3x	yes

A 500K-document gutenberg source can stay below 50% coverage for the entire 200K training run while never approaching epoch=1. Epoch penalty ignores it; coverage bonus actively pulls a bandit toward gutenberg's unexplored 70% tail.

Conversely, a 1.4K synthetic-chat source saturates coverage (100%) within a few thousand pulls; coverage bonus stays at 1.0x while epoch penalty grows.

Distinguish the Two

Imagine two sources mid-training: source A has 1,400 documents & 8,400 lifetime pulls. Source B has 500,000 documents & 80,000 lifetime pulls; the proxy has logged 75,000 unique doc indices for B so far. Which signal (epoch penalty or coverage bonus) governs each source's bandit weight, & why?

What Coverage Bonus Buys ANDREA

The Failure Mode It Prevents

Without doc-level tracking, a bandit selecting on per-step reward picks .btok sequences greedily. A 500K-document gutenberg corpus contains a few thousand sequences with low cross-entropy (consistent prose, common vocabulary). A reward-only bandit returns to those sequences repeatedly because they keep producing strong reward signals.

Result: a 500K-document corpus gets sampled across maybe 2K-5K distinct sequences over 200K training steps. The model memorizes those sequences without ever seeing the rest. Capacity wasted; coverage stuck below 1%.

What Coverage Bonus Buys

1.3x at 0% coverage, scaled down to 1.0x at 50%. That nudge propagates through UCB1 selection: arms with low coverage stay competitive even when their per-pull reward dips. The bandit explores the tail by design rather than by accident.

Across a 200K-step run on a 500K-doc gutenberg, coverage bonus typically raises observed coverage from ~3% (no bonus) to ~25-30% (with bonus). Same compute, eight to ten times more documents touched.

Where the Tracking Lives

Component	Responsibility
`microgpt_cuda.cu`	Reports doc index per training step
`training_proxy.py`	Maintains `seen_docs` set per source
`training_proxy.py`	Computes coverage, applies bonus to bandit reward
`training_proxy.py`	Persists `seen_docs` to `.state.json` across restarts

Connect to a Concrete Engineering Choice

Suppose you removed coverage bonus from ANDREA-120M training. Predict one concrete consequence for the gutenberg arm specifically (which has 500K+ documents) over a 200K-step run. Reference either coverage percentage, document diversity, or downstream sample quality.