Nous Analysis Releases Token Superposition Coaching to Velocity Up LLM Pre-Coaching by As much as 2.5x Throughout 270M to 10B Parameter Fashions

Pre-training massive language fashions is pricey sufficient that even modest effectivity enhancements can translate into significant price and time financial savings. Nous Analysis is releasing Token Superposition Coaching (TST), a technique that considerably reduces pre-training wall-clock time at fastened compute with out touching the mannequin structure, optimizer, tokenizer, parallelism technique, or coaching knowledge.

On the 10B-A1B mixture-of-experts scale, TST reaches a decrease ultimate coaching loss than a matched-FLOPs baseline whereas consuming 4,768 B200-GPU-hours versus the baseline’s 12,311 — roughly a 2.5x discount in complete pre-training time.

https://arxiv.org/pdf/2605.06546

The Downside TST is Fixing

Trendy LLM pre-training is closely data-driven. Latest coaching regimes routinely overtrain nicely past compute-optimal estimates, and uncooked textual content throughput. How a lot knowledge a mannequin can course of per FLOP has turn out to be a key lever. Subword tokenizers like BPE already enhance throughput by compressing sequences; and the analysis suggests a lot of the BPE benefit over byte-level fashions comes merely from shorter sequences, which suggests the mannequin sees extra textual content per unit of compute.

TST asks whether or not that throughput lever will be pulled additional throughout coaching, independently of the tokenizer and with out completely altering the mannequin.

How TST Works: Two Phases

TST modifies the usual pre-training loop in two sequential phases:

Part 1 — Superposition: For the primary r fraction of complete coaching steps (the paper finds r ∈ [0.2, 0.4] to be near optimum throughout examined scales), the mannequin doesn’t obtain particular person tokens. As a substitute, the enter sequence of size L is segmented into non-overlapping luggage of s contiguous tokens. Within the embedding layer, every bag is collapsed right into a single latent “s-token” by averaging the s token embeddings. The transformer then processes a sequence of size L/s.

Crucially, every TST step is saved equal-FLOPs to an ordinary coaching step by rising the info sequence size by s instances in the course of the superposition section. As a result of every latent place corresponds to s supply tokens, the mannequin ingests s instances as a lot textual content per unit of compute — that is what drives the throughput achieve.

On the output aspect, every latent place predicts the following bag of s tokens fairly than a single subsequent token. The usual cross-entropy loss is changed with a multi-hot cross-entropy (MCE) loss, which assigns equal likelihood mass 1/s to every token within the goal bag. The MCE loss reduces to a easy imply of normal cross-entropy phrases over the s targets — it may be carried out utilizing the prevailing fused CE kernels already current in any main pre-training library, with out writing a brand new kernel or including an auxiliary head.

Part 2 — Restoration: After the superposition section, coaching resumes from the saved checkpoint with normal next-token prediction for the remaining 1 - r steps. The TST code is absolutely eliminated at this boundary to keep away from any experimental contamination. A transient loss spike happens on the transition, sometimes between 1 and a couple of nats, which resolves inside just a few thousand steps. After that, the recovered mannequin crosses beneath the equal-FLOPs baseline and stays there.

The mannequin produced on the finish of Part 2 is architecturally an identical to 1 produced by standard pre-training, with the identical next-token prediction inference conduct.

What the Experiments Present

TST was validated at 4 scales: 270M and 600M dense (SmolLM2 shapes tailored to the Llama3 modeling code, with the Llama3-8B tokenizer and untied enter/output embeddings — which makes the 270M mannequin equal in measurement to SmolLM2-135M and the 600M to SmolLM2-360M), 3B dense (SmolLM3 form), and a 10B-A1B MoE within the Qwen3 household. Coaching used the DCLM dataset for the smaller runs and a 50/50 mixture of DCLM and FineWeb-Edu for the MoE run. All runs used AdamW with the Warmup-Steady-Decay studying price schedule and had been run in TorchTitan beneath FSDP parallelism, on 64 NVIDIA B200 GPUs for the bigger fashions and eight B200 GPUs for the smaller ones.

On the 3B scale with bag measurement s = 6 and step ratio r = 0.3, TST at 20,000 steps reaches a ultimate lack of 2.676 — practically matching a 36,000-step baseline at 2.677 — whereas utilizing 247 B200-GPU-hours versus 443. The 20k-step TST run scores 62.4 on HellaSwag and 66.3 on ARC-Straightforward, versus 62.3 and 65.9 for the 36k baseline.

On the 10B-A1B MoE scale with s = 16 and r ≈ 0.25, the TST run processes 2T knowledge tokens and achieves a ultimate lack of 2.236, beneath the baseline’s 2.252 after 1.05T tokens, whereas beating it on all 4 reported benchmarks: HellaSwag (71.2 vs. 70.1), ARC-Straightforward (74.2 vs. 73.8), ARC-Problem (47.3 vs. 46.3), and MMLU (39.0 vs. 37.4).

The analysis crew presents three comparability views in opposition to the baseline — equal-FLOPs, equal-loss, and equal-data. Beneath equal-FLOPs and equal-loss situations, TST constantly wins. Beneath equal complete token consumption, the baseline wins, as a result of TST’s efficient compute finances per knowledge token is smaller. This is a crucial boundary situation that determines the place TST applies.

Two Distinct Mechanisms

An ablation examine isolates the input-side and output-side parts. Each independently outperform the baseline; combining them produces additional enchancment with out indicators of interference. The authors interpret this as proof that TST is 2 orthogonal mechanisms fairly than a single trick.

The output-side mechanism — next-bag-of-tokens prediction — is conceptually associated to multi-token prediction (MTP). In contrast to MTP, which provides okay impartial prediction heads and additional parameters, TST retains a single output head and replaces solely the goal. This makes it the least costly member of a rising class of future-signal auxiliary targets. In contrast to MTP, it exhibits constant positive factors throughout all examined scales together with small fashions the place MTP has been proven to degrade efficiency.

The input-side mechanism has no direct analog within the current pre-training literature. The analysis crew gives two believable explanations: it might implicitly regularize the embedding geometry (since many random s-grams of tokens should stay linearly separable as soon as averaged), or it might act as a type of pre-pre-training, exposing the mannequin to a coarser model of the actual knowledge earlier than fine-resolution language modeling begins.

A focused ablation instantly assessments what occurs when illustration continuity is damaged. The analysis crew runs a 3B TST experiment the place the enter embedding and output LM head are randomly re-initialized in the beginning of Part 2. The consequence: ultimate loss jumps to 2.938 — worse than each the TST run (2.676) and the usual baseline (2.808). The Part 1 TST steps contributed nothing to the ultimate mannequin. This confirms that shared representations throughout each phases are usually not incidental to TST’s success — they’re what makes it work.

Marktechpost’s Visible Explainer

Token Superposition Coaching — Sensible Information
arXiv 2605.06546

01 / Overview

What Is Token Superposition Coaching?

Token Superposition Coaching (TST) is a two-phase pre-training methodology from Nous Analysis that will increase token throughput per FLOP with out altering the mannequin structure, optimizer, tokenizer, parallelism, or coaching knowledge.

The core concept: As a substitute of feeding one token at a time, common s contiguous token embeddings into one “s-token,” prepare on that for the primary r fraction of steps, then swap again to plain next-token prediction. The ultimate mannequin is architecturally an identical to 1 educated usually.

Part 1 (Superposition) — mannequin reads luggage of s tokens, predicts the following bag
Part 2 (Restoration) — normal next-token prediction resumes from the checkpoint
Inference — utterly unchanged; no new heads, no new parameters
Validated at 270M, 600M, 3B dense and 10B–A1B MoE

TST trades compute effectivity for larger knowledge consumption. Greatest fitted to compute-bound pre-training, not data-bound.

02 / Part 1

Part 1 — The Superposition Part

For the primary r fraction of complete coaching steps, the enter sequence of size L is break up into non-overlapping luggage of s contiguous tokens. Their embeddings are averaged right into a single latent s-token. The transformer processes a sequence of size L/s — however every place corresponds to s actual tokens, so throughput is s× larger on the identical FLOPs.

Equal-FLOPs trick: To maintain every step equal-FLOPs to baseline, the info sequence size is elevated by s× — not the batch measurement. Each TST step prices the identical compute as an ordinary step.

On the output aspect, the loss goal shifts from a single subsequent token to the following bag of s tokens. The multi-hot cross-entropy (MCE) loss assigns equal likelihood mass 1/s to every token within the goal bag:

# L_MCE = imply of s normal CE phrases
for i in vary(superposition_bag_size):
    goal = labels[..., i].flatten(0, 1)
    loss += torch.nn.practical.cross_entropy(pred, goal)
loss = loss / superposition_bag_size

No new kernel wanted — reuses the prevailing fused CE kernel in your pre-training library.

03 / Part 2

Part 2 — The Restoration Part

After r × total_steps of superposition coaching, resume from the checkpoint with the TST code absolutely eliminated. Customary next-token prediction runs for the remaining (1 — r) × total_steps.

What occurs on the swap: A loss spike of 1–2 nats happens on the section boundary. It resolves inside just a few thousand steps. After that, the mannequin crosses beneath the equal-FLOPs baseline and stays there.

Take away TST code absolutely — don’t preserve it as an auxiliary loss throughout Part 2
Do not re-initialize the enter embedding or LM head on the boundary
Shared representations throughout each phases are what make TST work

Re-initializing the embedding or LM head on the section boundary utterly breaks TST. In a 3B ablation, this raised ultimate loss from 2.676 to 2.938 — worse than the two.808 baseline. The Part 1 steps contributed nothing.

04 / Implementation

PyTorch Implementation

Three adjustments to the usual coaching loop — enter folding, averaged embedding lookup, and MCE loss.

# 1. Enter folding (inside prepare loop)
if superposition_bag_size just isn't None and superposition_bag_size > 1:
    bs, seq = inputs.form
    inputs = inputs.reshape(
        bs, seq // superposition_bag_size, superposition_bag_size
    )

# 2. Averaged embedding lookup (inside mannequin ahead)
if len(tokens.form) == 3:
    bs, sp_seq, superposition_bag_size = tokens.form
    h = self.tok_embeddings(tokens[..., 0]).float()
    for i in vary(1, superposition_bag_size):
        h = h + self.tok_embeddings(tokens[..., i]).float()
    h = (h / superposition_bag_size).to(h_dtype)
else:
    h = self.tok_embeddings(tokens)

Be aware: Sum in float32 for numerical precision, then forged again to coaching dtype. The embedding layer is the one forward-pass change.

05 / Hyperparameters

Tuning Bag Measurement `s` and Step Ratio `r`

Two hyperparameters management TST. Each have well-defined sensible ranges validated throughout mannequin scales.

Step Ratio r
0.2 — 0.4
Fraction of complete steps run in superposition mode. Strong throughout all examined scales. Under 0.2, throughput achieve is simply too small. Above 0.5, Part 2 can’t absolutely get better.

Bag Measurement s
3 — 16
U-shaped optimum that shifts with mannequin measurement. Begin within the flat basin; overshooting makes the bag goal too lossy to get better from.

Mannequin Measurement	Really helpful s	Really helpful r
270M	3 — 8	0.2 — 0.4
600M	6 — 10	0.2 — 0.4
3B	6 (examined)	0.3 (examined)
10B–A1B MoE	16 (examined)	∼0.25 (examined)

Giant bag sizes (s ≥ 8): Swap from uniform MCE loss weighting to power-law weighting (1/i per place). Motivated by mutual data between token pairs decaying as an influence regulation with distance (fitted exponent okay ≈ −1.25 on DCLM).

06 / Unfavourable Outcomes

What Doesn’t Work

The paper paperwork a number of variants that had been examined and failed. Save your self the compute.

Positional encodings earlier than averaging — including RoPE or sinusoidal encodings to tokens earlier than the imply constantly harm efficiency. Inside-bag permutation invariance seems to be a characteristic, not a bug.
RoPE rescaling at section transition — accelerated early Part 2 restoration however typically raised ultimate loss. Depart RoPE unchanged throughout the boundary.
s impartial heads — changing the only MCE head with s separate heads predicting s positions gave no constant achieve at larger parameter price and implementation complexity.
Binary cross-entropy / hinge loss — each considerably underperformed the MCE formulation and even fell beneath the baseline.
Retaining TST head in Part 2 — not but benchmarked however recognized as future work; don’t assume it helps.

Backside line: The best model works finest — imply embeddings in, imply CE loss out, arduous swap on the section boundary, no further parameters.

07 / Outcomes

Key Outcomes & When to Use TST

At equal wall-clock — identical compute, higher loss:

Scale	B200-hrs	TST Loss	Baseline Loss
3B dense	247	2.676	2.808
10B–A1B MoE	4,768	2.236	2.252 (@ 12,311 hrs)

At equal ultimate loss — wall-clock saved:

Scale	TST (B200-hrs)	Baseline (B200-hrs)	Speedup
3B dense	247	443	∼1.8×
10B–A1B MoE	4,768	12,311	∼2.5×

Use TST when
✓ You might be compute-bound
✓ You might have ample knowledge
✓ You need decrease loss on the identical FLOPs
✓ You want the identical inference mannequin

Keep away from TST when
✕ Knowledge is the bottleneck (TST makes use of s× extra tokens in Part 1)
✕ You evaluate at equal token consumption
✕ Beneath equal-data situations, baseline wins

Paper: arXiv 2605.06546 • nousresearch.com/token-superposition

Key Takeaways

Nous Analysis’s Token Superposition Coaching (TST) cuts LLM pre-training time by as much as 2.5x at matched FLOPs — no structure, tokenizer, or optimizer adjustments required.
Part 1 averages contiguous token embeddings into luggage and predicts the following bag through multi-hot cross-entropy; Part 2 reverts to plain next-token prediction from the identical checkpoint.
Validated at 270M, 600M, 3B dense, and 10B-A1B MoE — TST beats the baseline on loss and downstream evals (HellaSwag, ARC, MMLU) throughout all scales.
Optimum hyperparameters: bag measurement s ∈ [3–8] for smaller fashions, step ratio r ∈ [0.2, 0.4]; shared embeddings throughout each phases are important — re-initializing them makes TST worse than the baseline.
Commerce-off: TST consumes extra uncooked knowledge tokens per compute finances — finest fitted to compute-bound coaching; the output-only variant is the choice for data-bound settings.

Take a look at the Paper and Project. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us

Source link

Nous Analysis Releases Token Superposition Coaching to Velocity Up LLM Pre-Coaching by As much as 2.5x Throughout 270M to 10B Parameter Fashions

What Is Token Superposition Coaching?

Part 1 — The Superposition Part

Part 2 — The Restoration Part

PyTorch Implementation

Tuning Bag Measurement `s` and Step Ratio `r`

What Doesn’t Work

Key Outcomes & When to Use TST

Who decides what AI tells you? Campbell Brown, as soon as Meta’s information chief, has ideas

Overworked AI Brokers Flip Marxist, Researchers Discover

Anthropic’s Cat Wu says that, sooner or later, AI will anticipate your wants earlier than you realize what they’re

Nous Analysis Releases Token Superposition Coaching to Velocity Up LLM Pre-Coaching by As much as 2.5x Throughout 270M to 10B Parameter Fashions

The Downside TST is Fixing

How TST Works: Two Phases

What the Experiments Present

Two Distinct Mechanisms

Marktechpost’s Visible Explainer

What Is Token Superposition Coaching?

Part 1 — The Superposition Part

Part 2 — The Restoration Part

PyTorch Implementation

Tuning Bag Measurement s and Step Ratio r

What Doesn’t Work

Key Outcomes & When to Use TST

Key Takeaways

Related Posts

Who decides what AI tells you? Campbell Brown, as soon as Meta’s information chief, has ideas

Overworked AI Brokers Flip Marxist, Researchers Discover

Anthropic’s Cat Wu says that, sooner or later, AI will anticipate your wants earlier than you realize what they’re

Tuning Bag Measurement `s` and Step Ratio `r`