un — Grow a Language Model: Phase-Based Dice Control

un

guest

1 / ?

back to lessons

The Lock-In Problem

A Bandit That Keeps Winning

Vanilla UCB1 recomputes scores every step. Picks one arm. Pulls it. Updates n_k & mean_reward(k). Repeats. In a long training run with many sources, a single arm can collect a streak of high rewards, drive its mean up, & become near-impossible to beat. Other arms stagnate at low n_k with stale means. Lock-in.

Lock-in hurts ANDREA in two ways:

1. Diversity collapse. A model that trains 90% of steps on one source learns that source's stylistic ticks. Generation samples drift toward repetitive patterns matching the dominant source.

2. Stale exploration. Arms with stale means cannot recover. An arm whose mean dropped early stays stuck at that mean even if the model has now grown enough capacity to extract reward from it.

A Phase Buys Time

Solution: hold a fixed set of focus arms for a phase (multiple steps) before re-evaluating. A phase of 14 steps means 14 forward passes hit the same focus arms. Mean rewards stabilize. Stochastic noise averages out. Then the bandit re-rolls.

Variable Phase Length

ANDREA picks phase length randomly from {7, 14, 21, 28, 42} steps at each phase boundary. Five values, uniform random. Short phases (7) react fast to bad picks; long phases (42) let stable focus sets exploit fully. The ceiling caps damage: at most 42 steps spent on a bad focus configuration before forced re-roll.

Dice Phase Timeline

Phase Length Statistics

ANDREA picks phase length uniformly at random from {7, 14, 21, 28, 42}. Compute (a) the expected (average) phase length, (b) the maximum phase length, (c) over 1,000 phases, the expected total steps. Show your arithmetic.

1d3 (2-eye) & 1d4 (3-eye)

Dice Notation

Tabletop notation: NdM means roll N dice with M sides each. 1d3 rolls one 3-sided die, returning a value in {1, 2, 3}. 1d4 rolls one 4-sided die, returning {1, 2, 3, 4}. ANDREA also allows the result 0 by convention: a roll of 0 means fully random phase (no UCB focus arms).

2-Eye vs 3-Eye Configurations

ANDREA's training config picks one of two dice modes:

2-eye config (1d3). Possible focus arm counts: {0, 1, 2, 3}. Result 0 reserved for random phase.

3-eye config (1d4). Possible focus arm counts: {0, 1, 2, 3, 4}. Larger pools allow more concentrated phases.

Random First, UCB Second

Whatever the dice rolls, ANDREA fills focus slots in two passes:

1. Random arms first. Pick a fraction of focus slots uniformly at random from all available arms. This forces combinatorial variety every phase, regardless of UCB rankings.

2. UCB fills remaining slots. Compute UCB1 scores for arms not already chosen. Take top-ranked remaining arms until focus slot count fills.

Random-first matters. If UCB picked first, a streak-leader would always claim a slot. With random-first, even the best UCB arm can sit out a phase. Diversity stays guaranteed.

Pure Random Phases

When dice rolls 0, the entire focus set comes from random picks. UCB contributes nothing. About 25% of phases (1d4) or 33% of phases (1d3) land here. Pure random phases force the bandit to refresh its sample of low-pulled arms, keeping mean_reward estimates honest across the whole arm pool.

Dice Outcome Probabilities

Under 1d3 dice (2-eye config) with possible outcomes {0, 1, 2, 3} all equally likely, compute (a) probability of a fully random phase (dice=0), (b) probability of at least one UCB arm (dice >= 1), (c) over 100 phases, the expected count of fully random phases. Then under 1d4 (3-eye config), give (d) the probability of a fully random phase. Show your reasoning.

Capping the Damage

A Bad Phase Costs Up To 42 Steps

Suppose UCB ranks pick a focus arm whose true mean is much lower than its observed mean. The phase locks that arm in. Reward stays low for the whole phase. How long until the bandit can correct?

Maximum phase length: 42 steps. After 42 steps, the phase ends, dice re-roll, focus arms re-shuffle. The bad pick cannot last longer than 42 forward passes.

Why 42 (& Not 100, & Not 1000)

Long phases let mean_reward estimates stabilize. Statistical theory: variance of a mean of n samples shrinks as 1/n. Going from 7 samples to 42 samples gives 6x more samples, sqrt(6) approx 2.45x tighter standard error. After 42 samples, mean_reward sits within roughly +/-15% of its true value (depending on reward variance).

Past 42 samples, the gain shrinks: 100 samples vs 42 samples = 2.4x more, sqrt(2.4) approx 1.55x tighter standard error. Marginal benefit drops as the cost of a bad lock-in grows. 42 steps balances the two.

Diversity vs Convergence

Short phases (7 steps): reward estimates stay noisy, but bad picks cost little. Long phases (42 steps): estimates tight, but bad picks cost more. ANDREA mixes phase lengths uniformly so both regimes appear in every training run.

Btok Rebuild Cost

Each phase boundary triggers a btok file rebuild for the focus arms. Btok rebuild runs in a background thread; CUDA hot-reloads on mtime change. The rebuild takes seconds; phases must run long enough that rebuild overhead stays small. 42 steps at ANDREA-120M training speed comfortably exceeds rebuild time.

Reasoning About the Ceiling

ANDREA finished a 1,000-step training run. The bandit picked a bad focus arm at step 800. Without the 42-step ceiling, that bad arm could persist arbitrarily long. With the 42-step ceiling, what is the worst-case wasted-step count from step 800? Then explain in two sentences: (a) why a longer ceiling (e.g. 200 steps) would be worse, & (b) why a shorter ceiling (e.g. 7 steps always) would also be worse.

Coming Up Next

What You Have

Phase-based dice control wraps UCB1 in three protective rules: variable phase length (7-42), random arms first, dice-driven random phases (25-33% pure random). The 42-step ceiling caps damage; the random phases prevent lock-in; the variable lengths mix reaction speed with estimate stability.

What Remains

Where does the reward signal that feeds UCB actually come from? Activity 78 (reward attribution) shows how CUDA reports per-source loss every step, how a per-source EMA tracks reward, & why ANDREA scales raw rewards by 1000x before feeding UCB1.

Floors & epoch penalties (activity 79) layer further protective rules on top of the bandit's output, ensuring tiny sources do not get starved & large sources do not get repeated to memorization.

Reference

ANDREA whitepaper, section 3.2.