un — Grow a Language Model: Source Floors & Epoch Penalty

un

guest

1 / ?

back to lessons

Why Floors Exist

A Bad Reward Streak Can Starve a Priority Source

ANDREA's bandit picks focus arms by UCB1 rank. UCB rank depends on mean_reward(k), which depends on observed loss improvements. A streak of high-loss documents from a priority source (say dictionary) can drag mean_reward(k) down. Now dictionary ranks low, gets few focus pulls, & its mean_reward(k) cannot recover (no pulls = no fresh observations).

Same risk applies to any source that ANDREA's training designer wants in the mix regardless of short-term reward signal.

Floors as Minimum Weights

ANDREA's training config specifies a floor per source: a minimum sampling weight that the source receives no matter what UCB output says. Floors run from 0.0 to 1.0. Examples:

hermes3-general floor = 0.8 (priority conversational source)

chat floor = 0.8

dictionary floor = 0.7 (factual recall scaffold)

gutenberg floor = 0.7 (prose coherence)

synthetic-chat floor = 0.0 (no floor; bandit decides freely)

How Floors Apply

After UCB1 ranks arms & dice control assembles focus sets, each source gets a tentative weight. Then floor enforcement runs:

final_weight_k = max(tentative_weight_k, floor_k)

If the bandit assigned weight 0.3 to hermes3-general but its floor is 0.8, the floor wins: final weight = 0.8. The bandit's voice gets overridden upward only; it never gets overridden downward.

Floors & Epoch Penalty Layout

Different Configs, Different Floors

ANDREA ships several training configurations: chatbot, tool-caller, bash-commander. Each config sets different floors for its priority sources. Chatbot floors hermes3-general & chat high. Tool-caller floors repo-docstrings higher. Bash-commander floors repo-commits higher. Same algorithm, different priorities.

Apply a Floor

After UCB1 + dice control, the bandit assigns these tentative weights: hermes3-general 0.30, dictionary 0.55, gutenberg 0.85, synthetic-chat 0.40. Floors are: hermes3-general 0.80, dictionary 0.70, gutenberg 0.70, synthetic-chat 0.00. Compute the final weight for each source after floor enforcement. Then explain in one sentence which source had its weight lifted the most.

The Memorization Risk

Tiny Sources Get Memorized

ANDREA's data sources vary wildly in size. synthetic-chat has roughly 1,400 documents. gutenberg has 500,000+. If the bandit pulls evenly, synthetic-chat exhausts its document pool fast: after 1,400 pulls, every document has been seen at least once. Pull 2,800 times & every document has been seen at least twice on average.

Repeated exposure to a small set of documents leads to memorization: the model stops learning generalizable patterns & starts reciting specific token sequences from the training data. Memorization is bad for two reasons: (1) it wastes capacity on rote recall instead of generalization, & (2) it can leak training data through generation.

Epochs As a Memorization Proxy

Define an epoch over source k as one full pass through all of k's documents:

epochs_k = floor(lifetime_pulls_k / n_docs_k)

If synthetic-chat (n_docs=1400) has been pulled 2,800 times, epochs = floor(2800/1400) = 2: the source has been seen twice through. If gutenberg (n_docs=500,000) has been pulled 100,000 times, epochs = floor(100000/500000) = 0: not yet a full pass.

The 1/(1+epochs) Penalty

When lifetime_pulls / n_docs > 1.0, ANDREA applies a multiplicative penalty:

penalty = 1 / (1 + epochs)

final_weight = bandit_weight * penalty

Curve:

epochs	penalty	weight reduction
0	1.000	none
1	0.500	half
2	0.333	one third
3	0.250	one quarter
5	0.167	one sixth
10	0.091	one eleventh

Penalty grows with each completed pass. After many epochs, the source's weight approaches zero & the bandit naturally rests it.

Why Lifetime Pulls Persist Across Restarts

ANDREA's training runs span days. Crashes happen. Servers reboot. Configurations get tweaked & training resumes from a checkpoint. Lifetime pulls persist across all of these events: the proxy writes pull counts to disk continuously.

If pulls reset on each restart, a small source could effectively reset to epoch 0 every time training restarts. The penalty would never accumulate, & memorization would proceed regardless. Persisting lifetime pulls makes the penalty a real, monotone-growing constraint.

Compute an Epoch Penalty

Source `synthetic-chat` has n_docs = 1,400. After 4,200 lifetime pulls, compute (a) the epoch count, (b) the penalty 1/(1+epochs), (c) the final weight if the bandit weight is 1.0. Then for `gutenberg` with n_docs = 500,000 & 100,000 lifetime pulls, compute (d) lifetime_pulls/n_docs, & (e) whether the penalty applies (yes or no, with reason).

Closing the Bandit Curriculum Stack

What You Have

Floors guarantee minimum sampling for priority sources: final_weight = max(bandit_weight, floor_k). Epoch penalties cap memorization on small sources: when lifetime_pulls/n_docs > 1, weight gets multiplied by 1/(1+epochs). Lifetime pulls persist across restarts so the penalty becomes a monotone constraint, not a resettable counter.

The Full Pipeline

Putting all four ANDREA bandit activities (76-79) together:

1. Activity 76 (UCB1). Each step computes UCB(k) = mean_reward(k) + 0.5 * sqrt(ln(N)/n_k). Argmax picks an arm.

2. Activity 77 (dice phases). Phase boundaries (every 7 to 42 steps) roll dice for focus arm count. Random arms first, UCB fills the rest. 25-33% of phases run fully random.

3. Activity 78 (reward attribution). CUDA reports loss; per-source EMA tracks history; reward = max(0, EMA - loss) * 1000. Scaled reward feeds mean_reward(k).

4. Activity 79 (floors & epochs, this lesson). After UCB output, floors lift priority sources; epoch penalties downweight memorized sources. Lifetime pulls persist.

Together: a bandit that adapts (UCB1), explores reliably (dice phases), gets honest reward signals (1000x scaling), respects training-design priorities (floors), & avoids memorization (epoch penalty).

Reference

ANDREA whitepaper, sections 3.5 & 3.6.