Exponential Moving Average
A Smoothed Recent Average
An exponential moving average (EMA) tracks a value by weighting recent samples more than old ones, with weights decaying exponentially. Formula:
EMA(t) = (1 - alpha) EMA(t-1) + alpha value(t)
Where alpha (the smoothing factor) sits in (0, 1). ANDREA uses alpha = 0.1 for per-source loss tracking.
Term by Term
- value(t): latest observation. For ANDREA, this is loss reported by CUDA after a forward pass on a document from source k.
- EMA(t-1): previous EMA value for source k. Stored in proxy state.
- alpha = 0.1: each new loss contributes 10%; the rolling history contributes 90%.
- (1 - alpha) = 0.9: weight on history.
Why EMA Instead of Plain Mean
A plain running mean weights every step equally. Step 1 has the same weight as step 100,000. That works if the data are stationary. ANDREA's losses are NOT stationary: model capacity grows during training, so a source's loss at step 5,000 differs from its loss at step 50,000.
EMA solves this. Old loss values fade exponentially. The EMA reflects recent reality, not initial-conditions averaging.
Per-Source
ANDREA maintains one EMA per arm (per source). Sixteen arms = sixteen EMAs. Each step updates only the EMA of the source that was pulled. The other 15 EMAs stay frozen until their next pull.
Compute an EMA Step
The Reward Formula
Reward = Improvement, Scaled
ANDREA defines per-step reward for arm k as:
reward_k = max(0, EMA_k(t-1) - loss_k(t)) * 1000
Three pieces:
1. EMA_k(t-1) - loss_k(t): improvement. If the new loss came in below the running average, the difference is positive: source k did better than expected.
2. max(0, ...): clip negative improvements to zero. If the new loss came in worse than the EMA, no reward (but no penalty either).
3. \* 1000: scale up to make the signal comparable to the UCB exploration bonus.
Why max(0, ...)
Negative rewards would push mean_reward(k) down, biasing UCB against arms whose losses fluctuated upward. But fluctuation is normal: a single hard document raises loss without meaning the source is bad. Clipping to zero treats fluctuation as 'no information' rather than 'penalty'.
Sources can earn zero reward repeatedly without sinking. Their UCB rank stays driven by exploration bonus (high when n_k is small) plus past wins.
What CUDA Reports
Each forward+backward pass, the CUDA kernel emits one record:
{source: 'hermes3-general', doc_index: 4231, loss: 4.520}
Proxy receives the record, looks up EMA for that source, computes reward, updates EMA, feeds reward into the bandit's mean_reward(k) accumulator.
Compute a Reward
Matching Reward to Exploration Bonus
The Magnitude Problem
Per-step loss improvements run small. Loss falls from 4.521 to 4.520: difference 0.001. From 4.520 to 4.518: difference 0.002. Across an entire training run, raw differences live in roughly [0, 0.01].
Now look at UCB's exploration bonus at C=0.5, with N=1000 & n_k=20:
0.5 sqrt(ln(1000) / 20) = 0.5 sqrt(6.91 / 20) = 0.5 * 0.588 = 0.294
The bonus runs at 0.294. The raw reward runs at 0.001. The bonus is 300x larger than the reward. UCB's argmax sorts almost entirely by the bonus; mean_reward provides essentially zero signal.
Result without scaling: ANDREA's bandit picks the arm with the smallest n_k every step. Mean_reward gets ignored. The bandit becomes a pure exploration policy.
The Fix: 1000x
Multiply raw reward by 1000. Now the reward sits at 1.0 (vs raw 0.001). Compare to the same exploration bonus 0.294:
scaled reward 1.0 vs bonus 0.294 = reward leads by 3.4x
Now mean_reward dominates the UCB ranking. Exploration adds nuance to the tail (rare arms get a 0.3 boost), but the body of the ranking comes from observed reward.
Why 1000 (& Not 10, & Not 100,000)
Order of magnitude matching. Raw rewards run ~10^-3. Exploration bonus runs ~10^0. The gap is 10^3. Multiply raw reward by 10^3 to land in the same range as the bonus.
Scaling by 100x leaves reward at 0.1 (still less than 0.294 bonus -> exploration still dominates). Scaling by 100,000x lifts reward to 100 (now exploration cannot influence anything; UCB collapses to greedy mean_reward). 1000x sits in the working zone where both terms contribute.
Calibration, Not Theory
The 1000x factor is engineering calibration, not a theoretical constant. It depends on three things: training loss scale (cross-entropy on a vocabulary of 8K tokens runs near 4.5), per-step loss decay rate (slow), & UCB constant C=0.5. Change any of those, & 1000 might no longer be the right multiplier.
Reasoning About the Scaling Factor
Coming Up Next
What You Have
Reward attribution converts CUDA's loss reports into UCB-ready signal in three operations: per-source EMA tracks loss history (alpha=0.1, slow), reward = max(0, EMA - loss) clips negatives, & 1000x scaling matches reward magnitude to UCB exploration bonus magnitude.
What Remains
Floors & epoch penalties (activity 79) operate on top of UCB output. Source floors guarantee minimum sampling for priority sources regardless of UCB rank. Epoch penalties downweight sources that have been pulled more times than they have documents, preventing memorization of small datasets while large datasets stay fresh.
Reference
ANDREA whitepaper, section 3.3.