un — Grow a Language Model: Reward Attribution & EMA

un

guest

1 / ?

back to lessons

Exponential Moving Average

A Smoothed Recent Average

An exponential moving average (EMA) tracks a value by weighting recent samples more than old ones, with weights decaying exponentially. Formula:

EMA(t) = (1 - alpha) EMA(t-1) + alpha value(t)

Where alpha (the smoothing factor) sits in (0, 1). ANDREA uses alpha = 0.1 for per-source loss tracking.

Term by Term

- value(t): latest observation. For ANDREA, this is loss reported by CUDA after a forward pass on a document from source k.

- EMA(t-1): previous EMA value for source k. Stored in proxy state.

- alpha = 0.1: each new loss contributes 10%; the rolling history contributes 90%.

- (1 - alpha) = 0.9: weight on history.

Why EMA Instead of Plain Mean

A plain running mean weights every step equally. Step 1 has the same weight as step 100,000. That works if the data are stationary. ANDREA's losses are NOT stationary: model capacity grows during training, so a source's loss at step 5,000 differs from its loss at step 50,000.

EMA solves this. Old loss values fade exponentially. The EMA reflects recent reality, not initial-conditions averaging.

Per-Source

ANDREA maintains one EMA per arm (per source). Sixteen arms = sixteen EMAs. Each step updates only the EMA of the source that was pulled. The other 15 EMAs stay frozen until their next pull.

Compute an EMA Step

Source k has EMA_k(t-1) = 4.521. CUDA reports a new loss for source k: loss_k(t) = 4.520. With alpha = 0.1, compute EMA_k(t). Show: (a) the (1-alpha) term, (b) (1-alpha) * EMA_k(t-1), (c) alpha * loss_k(t), (d) the sum. Round to 4 decimals where useful.

The Reward Formula

Reward = Improvement, Scaled

ANDREA defines per-step reward for arm k as:

reward_k = max(0, EMA_k(t-1) - loss_k(t)) * 1000

Three pieces:

1. EMA_k(t-1) - loss_k(t): improvement. If the new loss came in below the running average, the difference is positive: source k did better than expected.

2. max(0, ...): clip negative improvements to zero. If the new loss came in worse than the EMA, no reward (but no penalty either).

3. \* 1000: scale up to make the signal comparable to the UCB exploration bonus.

Reward Attribution Flow

Why max(0, ...)

Negative rewards would push mean_reward(k) down, biasing UCB against arms whose losses fluctuated upward. But fluctuation is normal: a single hard document raises loss without meaning the source is bad. Clipping to zero treats fluctuation as 'no information' rather than 'penalty'.

Sources can earn zero reward repeatedly without sinking. Their UCB rank stays driven by exploration bonus (high when n_k is small) plus past wins.

What CUDA Reports

Each forward+backward pass, the CUDA kernel emits one record:

{source: 'hermes3-general', doc_index: 4231, loss: 4.520}

Proxy receives the record, looks up EMA for that source, computes reward, updates EMA, feeds reward into the bandit's mean_reward(k) accumulator.

Compute a Reward

Source k has EMA_k(t-1) = 4.521. CUDA reports loss_k(t) = 4.520. Compute reward_k step by step: (a) the difference EMA_k(t-1) - loss_k(t), (b) the clipped value max(0, difference), (c) the scaled reward (multiply by 1000). Then for a second case where loss_k(t) = 4.530 (worse than EMA), compute (d) the unclipped difference, & (e) the final reward after max(0, ...) clipping & scaling.

Matching Reward to Exploration Bonus

The Magnitude Problem

Per-step loss improvements run small. Loss falls from 4.521 to 4.520: difference 0.001. From 4.520 to 4.518: difference 0.002. Across an entire training run, raw differences live in roughly [0, 0.01].

Now look at UCB's exploration bonus at C=0.5, with N=1000 & n_k=20:

0.5 sqrt(ln(1000) / 20) = 0.5 sqrt(6.91 / 20) = 0.5 * 0.588 = 0.294

The bonus runs at 0.294. The raw reward runs at 0.001. The bonus is 300x larger than the reward. UCB's argmax sorts almost entirely by the bonus; mean_reward provides essentially zero signal.

Result without scaling: ANDREA's bandit picks the arm with the smallest n_k every step. Mean_reward gets ignored. The bandit becomes a pure exploration policy.

The Fix: 1000x

Multiply raw reward by 1000. Now the reward sits at 1.0 (vs raw 0.001). Compare to the same exploration bonus 0.294:

scaled reward 1.0 vs bonus 0.294 = reward leads by 3.4x

Now mean_reward dominates the UCB ranking. Exploration adds nuance to the tail (rare arms get a 0.3 boost), but the body of the ranking comes from observed reward.

Why 1000 (& Not 10, & Not 100,000)

Order of magnitude matching. Raw rewards run ~10^-3. Exploration bonus runs ~10^0. The gap is 10^3. Multiply raw reward by 10^3 to land in the same range as the bonus.

Scaling by 100x leaves reward at 0.1 (still less than 0.294 bonus -> exploration still dominates). Scaling by 100,000x lifts reward to 100 (now exploration cannot influence anything; UCB collapses to greedy mean_reward). 1000x sits in the working zone where both terms contribute.

Calibration, Not Theory

The 1000x factor is engineering calibration, not a theoretical constant. It depends on three things: training loss scale (cross-entropy on a vocabulary of 8K tokens runs near 4.5), per-step loss decay rate (slow), & UCB constant C=0.5. Change any of those, & 1000 might no longer be the right multiplier.

Reasoning About the Scaling Factor

Suppose ANDREA scaled raw rewards by 100x instead of 1000x. With raw reward 0.001 & UCB exploration bonus 0.294 (C=0.5, N=1000, n_k=20): (a) compute the scaled reward at 100x, (b) compute the scaled reward at 1000x, (c) for each scaling, identify which term (mean_reward or exploration bonus) dominates the UCB score for an arm whose mean_reward equals the scaled reward. (d) In one sentence, explain why 1000x balances the two.

Coming Up Next

What You Have

Reward attribution converts CUDA's loss reports into UCB-ready signal in three operations: per-source EMA tracks loss history (alpha=0.1, slow), reward = max(0, EMA - loss) clips negatives, & 1000x scaling matches reward magnitude to UCB exploration bonus magnitude.

What Remains

Floors & epoch penalties (activity 79) operate on top of UCB output. Source floors guarantee minimum sampling for priority sources regardless of UCB rank. Epoch penalties downweight sources that have been pulled more times than they have documents, preventing memorization of small datasets while large datasets stay fresh.

Reference

ANDREA whitepaper, section 3.3.