English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

guest
1 / ?
back to lessons

CUDA Reports Doc Indices

A CUDA Trainer Knows Which Document It Sampled

Each training step pulls a sequence from a .btok binary, which packs many documents end to end. CUDA records a doc index alongside loss: step 47213, source=gutenberg, doc=128407, loss=2.81. A proxy collects these reports & maintains a set of unique doc indices seen per source.


From Counts to Coverage

Coverage of a source = unique_docs_seen / n_docs. A few examples:


Sourcen_docsunique seencoverage
gutenberg512,000154,00030.1%
hermes3-general67,39547,17670.0%
dictionary88,00088,000100.0%
synthetic-chat1,4001,400100.0%

Tiny sources saturate fast. Large sources drift below 50% for weeks. Coverage bonus rewards a bandit for visiting documents it has not yet sampled within a source.


Coverage bonus per source


Bonus Formula

Coverage bonus scales linearly from 1.3x at 0% coverage down to 1.0x at 50% coverage, then flat at 1.0x above 50%:


if coverage < 0.5:
    bonus = 1.0 + 0.3 * (1.0 - coverage / 0.5)
else:
    bonus = 1.0

A source at 0% coverage earns 1.3x; a source at 25% earns 1.15x; a source at 50% drops to 1.0x. Above 50%, no bonus applies.

Compute the Bonus

A run with gutenberg coverage at 30% & hermes3-general coverage at 70%. Compute the coverage bonus multiplier for each source. Show your arithmetic.

Two Distinct Freshness Signals

Same Goal, Different Granularity

ANDREA has two mechanisms that prevent over-training on a single source. They sound similar; they measure different things.


Epoch penalty. Tracks aggregate over-pulling. When lifetime_pulls / n_docs > 1.0, a source has theoretically wrapped past every document at least once. Penalty = 1 / (1 + epochs). A 1.4K-document synthetic-chat source at 5,600 lifetime pulls (epochs = 4) earns penalty 1/5 = 0.2x. Epoch counts persist across restarts; they never decay.


Coverage bonus. Tracks per-document freshness within a source. CUDA reports doc indices; the proxy maintains a set per source. Sources below 50% coverage of unique docs earn up to 1.3x. Coverage rewards exploring a source's tail; epoch penalty punishes exhausting it.


Why Both Matter


SignalTracksDirectionCapPersists across restarts
Epoch penaltyaggregate over-pullingreduces1/(1+e)yes
Coverage bonusper-doc freshnessboosts1.3xyes

A 500K-document gutenberg source can stay below 50% coverage for the entire 200K training run while never approaching epoch=1. Epoch penalty ignores it; coverage bonus actively pulls a bandit toward gutenberg's unexplored 70% tail.


Conversely, a 1.4K synthetic-chat source saturates coverage (100%) within a few thousand pulls; coverage bonus stays at 1.0x while epoch penalty grows.

Distinguish the Two

Imagine two sources mid-training: source A has 1,400 documents & 8,400 lifetime pulls. Source B has 500,000 documents & 80,000 lifetime pulls; the proxy has logged 75,000 unique doc indices for B so far. Which signal (epoch penalty or coverage bonus) governs each source's bandit weight, & why?

What Coverage Bonus Buys ANDREA

The Failure Mode It Prevents

Without doc-level tracking, a bandit selecting on per-step reward picks .btok sequences greedily. A 500K-document gutenberg corpus contains a few thousand sequences with low cross-entropy (consistent prose, common vocabulary). A reward-only bandit returns to those sequences repeatedly because they keep producing strong reward signals.


Result: a 500K-document corpus gets sampled across maybe 2K-5K distinct sequences over 200K training steps. The model memorizes those sequences without ever seeing the rest. Capacity wasted; coverage stuck below 1%.


What Coverage Bonus Buys

1.3x at 0% coverage, scaled down to 1.0x at 50%. That nudge propagates through UCB1 selection: arms with low coverage stay competitive even when their per-pull reward dips. The bandit explores the tail by design rather than by accident.


Across a 200K-step run on a 500K-doc gutenberg, coverage bonus typically raises observed coverage from ~3% (no bonus) to ~25-30% (with bonus). Same compute, eight to ten times more documents touched.


Where the Tracking Lives


ComponentResponsibility
microgpt_cuda.cuReports doc index per training step
training_proxy.pyMaintains seen_docs set per source
training_proxy.pyComputes coverage, applies bonus to bandit reward
training_proxy.pyPersists seen_docs to .state.json across restarts

Connect to a Concrete Engineering Choice

Suppose you removed coverage bonus from ANDREA-120M training. Predict one concrete consequence for the gutenberg arm specifically (which has 500K+ documents) over a 200K-step run. Reference either coverage percentage, document diversity, or downstream sample quality.