Why Floors Exist
A Bad Reward Streak Can Starve a Priority Source
ANDREA's bandit picks focus arms by UCB1 rank. UCB rank depends on mean_reward(k), which depends on observed loss improvements. A streak of high-loss documents from a priority source (say dictionary) can drag mean_reward(k) down. Now dictionary ranks low, gets few focus pulls, & its mean_reward(k) cannot recover (no pulls = no fresh observations).
Same risk applies to any source that ANDREA's training designer wants in the mix regardless of short-term reward signal.
Floors as Minimum Weights
ANDREA's training config specifies a floor per source: a minimum sampling weight that the source receives no matter what UCB output says. Floors run from 0.0 to 1.0. Examples:
hermes3-general floor = 0.8 (priority conversational source)
chat floor = 0.8
dictionary floor = 0.7 (factual recall scaffold)
gutenberg floor = 0.7 (prose coherence)
synthetic-chat floor = 0.0 (no floor; bandit decides freely)
How Floors Apply
After UCB1 ranks arms & dice control assembles focus sets, each source gets a tentative weight. Then floor enforcement runs:
final_weight_k = max(tentative_weight_k, floor_k)
If the bandit assigned weight 0.3 to hermes3-general but its floor is 0.8, the floor wins: final weight = 0.8. The bandit's voice gets overridden upward only; it never gets overridden downward.
Different Configs, Different Floors
ANDREA ships several training configurations: chatbot, tool-caller, bash-commander. Each config sets different floors for its priority sources. Chatbot floors hermes3-general & chat high. Tool-caller floors repo-docstrings higher. Bash-commander floors repo-commits higher. Same algorithm, different priorities.
Apply a Floor
The Memorization Risk
Tiny Sources Get Memorized
ANDREA's data sources vary wildly in size. synthetic-chat has roughly 1,400 documents. gutenberg has 500,000+. If the bandit pulls evenly, synthetic-chat exhausts its document pool fast: after 1,400 pulls, every document has been seen at least once. Pull 2,800 times & every document has been seen at least twice on average.
Repeated exposure to a small set of documents leads to memorization: the model stops learning generalizable patterns & starts reciting specific token sequences from the training data. Memorization is bad for two reasons: (1) it wastes capacity on rote recall instead of generalization, & (2) it can leak training data through generation.
Epochs As a Memorization Proxy
Define an epoch over source k as one full pass through all of k's documents:
epochs_k = floor(lifetime_pulls_k / n_docs_k)
If synthetic-chat (n_docs=1400) has been pulled 2,800 times, epochs = floor(2800/1400) = 2: the source has been seen twice through. If gutenberg (n_docs=500,000) has been pulled 100,000 times, epochs = floor(100000/500000) = 0: not yet a full pass.
The 1/(1+epochs) Penalty
When lifetime_pulls / n_docs > 1.0, ANDREA applies a multiplicative penalty:
penalty = 1 / (1 + epochs)
final_weight = bandit_weight * penalty
Curve:
| epochs | penalty | weight reduction |
|---|---|---|
| 0 | 1.000 | none |
| 1 | 0.500 | half |
| 2 | 0.333 | one third |
| 3 | 0.250 | one quarter |
| 5 | 0.167 | one sixth |
| 10 | 0.091 | one eleventh |
Penalty grows with each completed pass. After many epochs, the source's weight approaches zero & the bandit naturally rests it.
Why Lifetime Pulls Persist Across Restarts
ANDREA's training runs span days. Crashes happen. Servers reboot. Configurations get tweaked & training resumes from a checkpoint. Lifetime pulls persist across all of these events: the proxy writes pull counts to disk continuously.
If pulls reset on each restart, a small source could effectively reset to epoch 0 every time training restarts. The penalty would never accumulate, & memorization would proceed regardless. Persisting lifetime pulls makes the penalty a real, monotone-growing constraint.
Compute an Epoch Penalty
Closing the Bandit Curriculum Stack
What You Have
Floors guarantee minimum sampling for priority sources: final_weight = max(bandit_weight, floor_k). Epoch penalties cap memorization on small sources: when lifetime_pulls/n_docs > 1, weight gets multiplied by 1/(1+epochs). Lifetime pulls persist across restarts so the penalty becomes a monotone constraint, not a resettable counter.
The Full Pipeline
Putting all four ANDREA bandit activities (76-79) together:
1. Activity 76 (UCB1). Each step computes UCB(k) = mean_reward(k) + 0.5 * sqrt(ln(N)/n_k). Argmax picks an arm.
2. Activity 77 (dice phases). Phase boundaries (every 7 to 42 steps) roll dice for focus arm count. Random arms first, UCB fills the rest. 25-33% of phases run fully random.
3. Activity 78 (reward attribution). CUDA reports loss; per-source EMA tracks history; reward = max(0, EMA - loss) * 1000. Scaled reward feeds mean_reward(k).
4. Activity 79 (floors & epochs, this lesson). After UCB output, floors lift priority sources; epoch penalties downweight memorized sources. Lifetime pulls persist.
Together: a bandit that adapts (UCB1), explores reliably (dice phases), gets honest reward signals (1000x scaling), respects training-design priorities (floors), & avoids memorization (epoch penalty).
Reference
ANDREA whitepaper, sections 3.5 & 3.6.