English· Español· Deutsch· Nederlands· Français· 日本語· ქართული· 繁體中文· 简体中文· Português· Русский· العربية· हिन्दी· Italiano· 한국어· Polski· Svenska· Türkçe· Українська· Tiếng Việt· Bahasa Indonesia

un

guest
1 / ?
back to lessons

Arms, Pulls, Rewards

A Row of Slot Machines

Imagine K slot machines lined up. Each machine pays a different average reward, but you do not know any of the averages. Every step you pick one machine, pull its lever, & observe a payout. Your goal: maximize total reward across many pulls.


That setup defines a multi-armed bandit problem. K = number of arms. A pull picks one arm. The reward comes from that arm's hidden distribution. The mean_reward(k) of arm k tracks the running average across pulls of k.


Exploration vs Exploitation

Two pressures pull against each other:


Exploitation pulls the arm with the highest observed mean_reward. Greedy. Risk: an arm that looked great after 2 pulls might regress to a lower true mean.


Exploration pulls a less-tested arm to reduce uncertainty. Risk: time spent exploring costs reward you could have collected from a known good arm.


A good policy mixes both. Pure exploitation locks in early winners forever. Pure exploration never collects a payoff.


ANDREA's Arms = Data Sources

ANDREA frames training as a bandit. Each data source (hermes3-general, gutenberg, dictionary, synthetic-chat, smoltalk, etc.) acts as an arm. Each training step pulls one arm: load a document from that source, run a forward pass, observe loss reduction. Mean reward per source tracks how much that source improves loss on average.


Why this fits: model needs change during training. Early steps benefit from broad exposure (many sources). Later steps benefit from targeted polish (a few high-reward sources). A bandit adapts; a fixed sampling ratio cannot.

Naming the Pieces

ANDREA has 16 data sources. After 1,000 training steps, hermes3-general has been pulled 250 times with average loss reduction 1.8; gutenberg has been pulled 600 times with average loss reduction 1.2. Name (a) K, (b) the pull count for hermes3-general (call it n_k), (c) which source a pure-exploitation policy picks next, & (d) one risk of that pure-exploitation choice.

The UCB1 Score

Auer, Cesa-Bianchi & Fischer, 2002

Peter Auer & colleagues published UCB1 in 2002 as a finite-time bandit policy with a regret bound of O(log N). UCB1 picks an arm by combining its current average reward with an exploration bonus that shrinks as the arm gets pulled.


UCB1 Score Diagram


The Formula


UCB(k) = mean_reward(k) + C * sqrt(ln(N) / n_k)


Term by term:


- mean_reward(k): average reward observed for arm k across its n_k pulls. Exploitation term.

- N: total pulls across all arms so far. Grows every step.

- n_k: pulls of arm k specifically. Grows only when k gets picked.

- C: exploration constant. Standard textbook value: 1.4 (sqrt(2)). ANDREA uses C=0.5.

- sqrt(ln(N) / n_k): confidence radius. A rarely-pulled arm has small n_k & gets a large bonus; a well-pulled arm has large n_k & gets a small bonus.


Why the Formula Works

ln(N) grows slowly (logarithmic), so the numerator climbs gently with total experience. n_k in the denominator shrinks the term as evidence accumulates for a specific arm. Square root softens the response, which keeps a single under-explored arm from dominating forever.


Picking by argmax_k UCB(k) every step balances both pressures automatically. No epsilon parameter, no schedule, no temperature. The math handles it.

Compute a UCB Score

Given C=0.5, N=100 total pulls, an arm with n_k=5 pulls & mean_reward(k)=2.3, compute UCB(k) step by step. Show: (a) ln(100), (b) ln(N)/n_k, (c) sqrt of part (b), (d) C times part (c), (e) final UCB(k). Use ln(100) approx 4.605.

Why C=0.5 Instead of 1.4

The Textbook Value

Auer, Cesa-Bianchi & Fischer derive a regret bound assuming rewards lie in [0, 1]. The bound holds for C = sqrt(2) approx 1.414. Most textbooks teach C=1.4 as a safe default.


ANDREA's Reward Magnitudes

ANDREA reports per-step loss improvements as reward. Raw improvements run around 0.001 (loss drops from 4.521 to 4.520). After the 1000x scaling factor (covered in activity 78, reward attribution), scaled rewards land near 1.0 in magnitude. Mean rewards across an arm's history settle into a 0.5 to 3.0 band.


Now consider the exploration bonus at C=1.4 with N=10000, n_k=100:


1.4 sqrt(ln(10000) / 100) = 1.4 sqrt(9.21 / 100) = 1.4 * 0.303 = 0.425


0.425 sits inside the reward band. Tolerable. Now n_k=10 (a rare arm):


1.4 sqrt(9.21 / 10) = 1.4 0.960 = 1.344


1.344 sits ABOVE most observed mean_rewards. Exploration dominates: the rare arm wins regardless of its actual mean. With many sources & long training runs, this floods the schedule with low-mean arms.


The Fix: C=0.5

0.5 sqrt(9.21 / 10) = 0.5 0.960 = 0.480


0.480 sits below most reward means. Exploration still happens (rare arms still get a bonus), but mean_reward leads. ANDREA exploits more, explores less, & avoids exploration dominance.


Tuning as Engineering, Not Theory

The textbook bound assumes rewards in [0, 1]. ANDREA's rewards live in roughly [0, 5]. Scaling the constant matches the constant to the reward magnitude. Same algorithm, calibrated environment.

Diagnose Exploration Dominance

Your bandit picks the same rarely-pulled arm 8 phases in a row, even though its mean_reward (0.3) sits well below other arms (means around 1.5 to 2.0). Compute the exploration bonus at C=1.4 for this arm with N=5000 & n_k=4 (use ln(5000) approx 8.52). Then explain in one sentence why ANDREA's choice of C=0.5 would change the outcome. Show your arithmetic.

Coming Up Next

What You Have

UCB1 picks the arm with the highest upper confidence bound. The bound combines mean_reward (exploit) with sqrt(ln(N) / n_k) scaled by C (explore). ANDREA tunes C=0.5 because per-step rewards sit in a 0.5 to 3.0 band, & the textbook 1.4 floods the schedule with low-mean arms.


What Remains

Vanilla UCB1 picks one arm at a time. ANDREA picks several focus arms per phase, mixes random arms first, & holds each phase for 7 to 42 steps. That structure (activity 77: dice phases) keeps UCB from overcommitting to early winners while still letting it concentrate effort.


Reward attribution (activity 78) shows where mean_reward(k) actually comes from. Floors & epoch penalties (activity 79) layer protective rules on top of UCB output.


Reference

- Auer, Cesa-Bianchi, Fischer (2002). "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning, 47, 235-256.

- ANDREA whitepaper, sections 3.1 & 3.4.