CUDA Reports Doc Indices
A CUDA Trainer Knows Which Document It Sampled
Each training step pulls a sequence from a .btok binary, which packs many documents end to end. CUDA records a doc index alongside loss: step 47213, source=gutenberg, doc=128407, loss=2.81. A proxy collects these reports & maintains a set of unique doc indices seen per source.
From Counts to Coverage
Coverage of a source = unique_docs_seen / n_docs. A few examples:
| Source | n_docs | unique seen | coverage |
|---|---|---|---|
| gutenberg | 512,000 | 154,000 | 30.1% |
| hermes3-general | 67,395 | 47,176 | 70.0% |
| dictionary | 88,000 | 88,000 | 100.0% |
| synthetic-chat | 1,400 | 1,400 | 100.0% |
Tiny sources saturate fast. Large sources drift below 50% for weeks. Coverage bonus rewards a bandit for visiting documents it has not yet sampled within a source.
Bonus Formula
Coverage bonus scales linearly from 1.3x at 0% coverage down to 1.0x at 50% coverage, then flat at 1.0x above 50%:
if coverage < 0.5:
bonus = 1.0 + 0.3 * (1.0 - coverage / 0.5)
else:
bonus = 1.0
A source at 0% coverage earns 1.3x; a source at 25% earns 1.15x; a source at 50% drops to 1.0x. Above 50%, no bonus applies.
Compute the Bonus
Two Distinct Freshness Signals
Same Goal, Different Granularity
ANDREA has two mechanisms that prevent over-training on a single source. They sound similar; they measure different things.
Epoch penalty. Tracks aggregate over-pulling. When lifetime_pulls / n_docs > 1.0, a source has theoretically wrapped past every document at least once. Penalty = 1 / (1 + epochs). A 1.4K-document synthetic-chat source at 5,600 lifetime pulls (epochs = 4) earns penalty 1/5 = 0.2x. Epoch counts persist across restarts; they never decay.
Coverage bonus. Tracks per-document freshness within a source. CUDA reports doc indices; the proxy maintains a set per source. Sources below 50% coverage of unique docs earn up to 1.3x. Coverage rewards exploring a source's tail; epoch penalty punishes exhausting it.
Why Both Matter
| Signal | Tracks | Direction | Cap | Persists across restarts |
|---|---|---|---|---|
| Epoch penalty | aggregate over-pulling | reduces | 1/(1+e) | yes |
| Coverage bonus | per-doc freshness | boosts | 1.3x | yes |
A 500K-document gutenberg source can stay below 50% coverage for the entire 200K training run while never approaching epoch=1. Epoch penalty ignores it; coverage bonus actively pulls a bandit toward gutenberg's unexplored 70% tail.
Conversely, a 1.4K synthetic-chat source saturates coverage (100%) within a few thousand pulls; coverage bonus stays at 1.0x while epoch penalty grows.
Distinguish the Two
What Coverage Bonus Buys ANDREA
The Failure Mode It Prevents
Without doc-level tracking, a bandit selecting on per-step reward picks .btok sequences greedily. A 500K-document gutenberg corpus contains a few thousand sequences with low cross-entropy (consistent prose, common vocabulary). A reward-only bandit returns to those sequences repeatedly because they keep producing strong reward signals.
Result: a 500K-document corpus gets sampled across maybe 2K-5K distinct sequences over 200K training steps. The model memorizes those sequences without ever seeing the rest. Capacity wasted; coverage stuck below 1%.
What Coverage Bonus Buys
1.3x at 0% coverage, scaled down to 1.0x at 50%. That nudge propagates through UCB1 selection: arms with low coverage stay competitive even when their per-pull reward dips. The bandit explores the tail by design rather than by accident.
Across a 200K-step run on a 500K-doc gutenberg, coverage bonus typically raises observed coverage from ~3% (no bonus) to ~25-30% (with bonus). Same compute, eight to ten times more documents touched.
Where the Tracking Lives
| Component | Responsibility |
|---|---|
microgpt_cuda.cu | Reports doc index per training step |
training_proxy.py | Maintains seen_docs set per source |
training_proxy.py | Computes coverage, applies bonus to bandit reward |
training_proxy.py | Persists seen_docs to .state.json across restarts |