Arms, Pulls, Rewards
A Row of Slot Machines
Imagine K slot machines lined up. Each machine pays a different average reward, but you do not know any of the averages. Every step you pick one machine, pull its lever, & observe a payout. Your goal: maximize total reward across many pulls.
That setup defines a multi-armed bandit problem. K = number of arms. A pull picks one arm. The reward comes from that arm's hidden distribution. The mean_reward(k) of arm k tracks the running average across pulls of k.
Exploration vs Exploitation
Two pressures pull against each other:
Exploitation pulls the arm with the highest observed mean_reward. Greedy. Risk: an arm that looked great after 2 pulls might regress to a lower true mean.
Exploration pulls a less-tested arm to reduce uncertainty. Risk: time spent exploring costs reward you could have collected from a known good arm.
A good policy mixes both. Pure exploitation locks in early winners forever. Pure exploration never collects a payoff.
ANDREA's Arms = Data Sources
ANDREA frames training as a bandit. Each data source (hermes3-general, gutenberg, dictionary, synthetic-chat, smoltalk, etc.) acts as an arm. Each training step pulls one arm: load a document from that source, run a forward pass, observe loss reduction. Mean reward per source tracks how much that source improves loss on average.
Why this fits: model needs change during training. Early steps benefit from broad exposure (many sources). Later steps benefit from targeted polish (a few high-reward sources). A bandit adapts; a fixed sampling ratio cannot.
Naming the Pieces
The UCB1 Score
Auer, Cesa-Bianchi & Fischer, 2002
Peter Auer & colleagues published UCB1 in 2002 as a finite-time bandit policy with a regret bound of O(log N). UCB1 picks an arm by combining its current average reward with an exploration bonus that shrinks as the arm gets pulled.
The Formula
UCB(k) = mean_reward(k) + C * sqrt(ln(N) / n_k)
Term by term:
- mean_reward(k): average reward observed for arm k across its n_k pulls. Exploitation term.
- N: total pulls across all arms so far. Grows every step.
- n_k: pulls of arm k specifically. Grows only when k gets picked.
- C: exploration constant. Standard textbook value: 1.4 (sqrt(2)). ANDREA uses C=0.5.
- sqrt(ln(N) / n_k): confidence radius. A rarely-pulled arm has small n_k & gets a large bonus; a well-pulled arm has large n_k & gets a small bonus.
Why the Formula Works
ln(N) grows slowly (logarithmic), so the numerator climbs gently with total experience. n_k in the denominator shrinks the term as evidence accumulates for a specific arm. Square root softens the response, which keeps a single under-explored arm from dominating forever.
Picking by argmax_k UCB(k) every step balances both pressures automatically. No epsilon parameter, no schedule, no temperature. The math handles it.
Compute a UCB Score
Why C=0.5 Instead of 1.4
The Textbook Value
Auer, Cesa-Bianchi & Fischer derive a regret bound assuming rewards lie in [0, 1]. The bound holds for C = sqrt(2) approx 1.414. Most textbooks teach C=1.4 as a safe default.
ANDREA's Reward Magnitudes
ANDREA reports per-step loss improvements as reward. Raw improvements run around 0.001 (loss drops from 4.521 to 4.520). After the 1000x scaling factor (covered in activity 78, reward attribution), scaled rewards land near 1.0 in magnitude. Mean rewards across an arm's history settle into a 0.5 to 3.0 band.
Now consider the exploration bonus at C=1.4 with N=10000, n_k=100:
1.4 sqrt(ln(10000) / 100) = 1.4 sqrt(9.21 / 100) = 1.4 * 0.303 = 0.425
0.425 sits inside the reward band. Tolerable. Now n_k=10 (a rare arm):
1.4 sqrt(9.21 / 10) = 1.4 0.960 = 1.344
1.344 sits ABOVE most observed mean_rewards. Exploration dominates: the rare arm wins regardless of its actual mean. With many sources & long training runs, this floods the schedule with low-mean arms.
The Fix: C=0.5
0.5 sqrt(9.21 / 10) = 0.5 0.960 = 0.480
0.480 sits below most reward means. Exploration still happens (rare arms still get a bonus), but mean_reward leads. ANDREA exploits more, explores less, & avoids exploration dominance.
Tuning as Engineering, Not Theory
The textbook bound assumes rewards in [0, 1]. ANDREA's rewards live in roughly [0, 5]. Scaling the constant matches the constant to the reward magnitude. Same algorithm, calibrated environment.
Diagnose Exploration Dominance
Coming Up Next
What You Have
UCB1 picks the arm with the highest upper confidence bound. The bound combines mean_reward (exploit) with sqrt(ln(N) / n_k) scaled by C (explore). ANDREA tunes C=0.5 because per-step rewards sit in a 0.5 to 3.0 band, & the textbook 1.4 floods the schedule with low-mean arms.
What Remains
Vanilla UCB1 picks one arm at a time. ANDREA picks several focus arms per phase, mixes random arms first, & holds each phase for 7 to 42 steps. That structure (activity 77: dice phases) keeps UCB from overcommitting to early winners while still letting it concentrate effort.
Reward attribution (activity 78) shows where mean_reward(k) actually comes from. Floors & epoch penalties (activity 79) layer protective rules on top of UCB output.
Reference
- Auer, Cesa-Bianchi, Fischer (2002). "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 47, 235-256.
- ANDREA whitepaper, sections 3.1 & 3.4.