Nous Analysis Proposes Lighthouse Consideration: A Coaching-Solely Choice-Primarily based Hierarchical Consideration That Delivers 1.4–1.7× Pretraining Speedup at Lengthy Context

Coaching massive language fashions on lengthy sequences has a widely known downside: consideration is dear. The scaled dot-product consideration (SDPA) on the core of each transformer scales quadratically Θ(N²) in each compute and reminiscence with sequence size N. FlashAttention addressed this via IO-aware tiling that avoids materializing the total N×N consideration matrix in high-bandwidth reminiscence, decreasing the reminiscence footprint considerably, however the underlying Θ(N²) compute scaling stays. Researchers at Nous Analysis have launched a brand new technique referred to as Lighthouse Consideration that addresses this bottleneck particularly at pretraining time, reaching a 1.40× to 1.69× end-to-end wall-clock speedup in opposition to a cuDNN-backed SDPA baseline, with matching or decrease last coaching loss.

The core downside with current sparse consideration strategies

To grasp why Lighthouse works the best way it does, it helps to know what current sparse consideration strategies do. Most prior work like NSA, HISA, DSA, MoBA makes the identical two design choices. First, they pool solely the important thing and worth facet whereas leaving queries at full decision (uneven compression). Second, their choice logic lives inside a customized consideration kernel, which implies groups can’t reuse the optimized dense-attention kernels that trendy GPU tensor cores are constructed round.

There may be additionally a priority particular to coaching that inference-only sparse strategies don’t face. An inference-time sparse technique is evaluated solely in opposition to its dense spine and it’s at most pretty much as good as that spine. A training-time sparse technique faces a tougher check: as soon as coaching is finished, will the ensuing weights nonetheless produce a reliable dense-attention mannequin at inference? Lighthouse treats that query as its central correctness criterion.

Lighthouse takes a special method on each design choices. It swimming pools queries, keys, and values symmetrically throughout a multi-level pyramid, and it locations choice completely exterior the eye kernel. After choice, the system gathers the chosen entries right into a contiguous, dense sub-sequence and runs inventory FlashAttention on it — the identical kernel utilized by the dense baseline.

https://arxiv.org/pdf/2605.06554

How the four-stage pipeline works

A Lighthouse consideration layer wraps round, however doesn’t modify, scaled dot-product consideration. The pipeline has 4 phases.

Within the first stage, common pooling constructs an L-level pyramid from Q, Okay, and V. With pooling issue p, stage ℓ of the pyramid has N/p^ℓ tokens, every summarizing p^ℓ base positions. Crucially, the identical pooling applies to all three projections, producing coherent (Q^(ℓ), Okay^(ℓ), V^(ℓ)) triples at each stage. Whole pyramid building prices Θ(N) time and reminiscence.

Within the second stage, a parameter-free scorer assigns every pyramid entry two scalar scores utilizing per-head ℓ₂ norms: one as a question rating (∥Q^(ℓ)_i∥₂) and one as a key rating (∥Okay^(ℓ)_i∥₂). Coarser ranges inherit scores from finer ones by way of max-pooling, so a rough span picks up the significance of its strongest token. A fused chunked-bitonic top-Okay kernel then selects okay entries collectively throughout all pyramid ranges. One design element price noting: the coarsest pyramid stage is at all times retained in full — it’s low-cost and ensures no less than one contributor at each base place; the remaining choice funds is spent on finer ranges. Moreover, the chunked-bitonic design produces a stratified top-Okay quite than a strict world top-Okay: the rating stream is partitioned into fixed-size chunks, every sustaining an in-register top-m buffer, so if the okay globally highest-scoring entries clustered in a single chunk, some would get replaced by lower-scoring entries from different chunks. The result’s extra balanced consideration protection throughout the sequence and avoids choice collapse onto a slim span.

The highest-Okay step is discrete and non-differentiable — no straight-through estimator, no Gumbel softmax. Choice indices carry no gradient. Gradients stream solely via the gathered Q, Okay, V entries into WQ, WK, WV, so the projections be taught to supply values which are helpful when chosen quite than scores which are good at choosing.

Within the third stage, the chosen entries are gathered right into a contiguous sub-sequence of size S = N/p^(L−1) + (L−1)·p·okay and handed to straightforward FlashAttention. At N = 1,000,000 with L = 4, p = 4, okay = 4,096, S ≈ 65,000 — far smaller than N. A essential property of the gathering course of is that it ensures no “holes” or empty areas within the assembled sub-sequence. This issues particularly as a result of Lighthouse additionally compresses queries: a niche within the sequence would imply these lacking tokens don’t have any gradient path in the course of the backward cross and will trigger coaching instabilities. Uneven strategies that go away queries at full decision don’t face this downside, however Lighthouse’s symmetric design requires that the gathered sub-sequence stays totally dense.

Within the fourth stage, every output entry is scattered again to the p^ℓ base positions it represents by way of a deterministic integer-atomic scatter kernel, with a shift of p^ℓ − 1 to protect causality. The per-position fan-in is bounded by L no matter okay.

https://arxiv.org/pdf/2605.06554

Why symmetric pooling adjustments the compute

Pooling queries alongside keys and values adjustments the computational character of the eye name from O(N Sd) to O(S² d) at coaching time. As a result of S ≪ N at lengthy contexts, that is what produces the latency benefit. Benchmarked on a single NVIDIA B200 at 512K context (bfloat16, B=1, H=8, head dimension 128, L=3, p=4, sparsity ≈ 1:64), Lighthouse is 21× quicker on the ahead cross and 17.3× quicker on the mixed ahead+backward cross relative to cuDNN-backed SDPA.

From an asymptotic standpoint, setting L = logp(N/okay) provides a gathered sub-sequence dimension of S = Θ(okay log N), which makes the dense FlashAttention name value Θ(k² log² N d) — polylogarithmic in N at mounted okay. Mixed with the linear-cost phases (pyramid building, scoring, scatter-back), complete per-layer compute is Θ(T d) at bounded okay — the identical asymptotic class as linear consideration and SSMs — whereas preserving softmax consideration’s recall properties on the chosen sub-sequence.

Inference is a special constraint. Autoregressive decoding presents one question at a time, which violates the idea that every one queries co-occur in a single ahead cross. Lighthouse is a training-only technique, and the symmetric pooling design can’t be used instantly at inference.

The 2-stage coaching recipe and recoverability

The experimental setup used a 530M-parameter Llama-3-style decoder (dmodel=1024, 30 layers, 8 heads, head dimension 128, FFN width 1536, byte-level tokenizer), skilled on C4 at 98,304-token context with AdamW at studying fee 2×10⁻³, β1=0.9, β2=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, and FSDP. One implementation element that issues for practitioners: of the 30 layers, layers {0, 1, 28, 29} retain dense SDPA all through — solely the opposite 26 layers use Lighthouse. The interior consideration name inside these 26 Lighthouse layers makes use of the identical cuDNN-backed SDPA kernel because the dense baseline.

The coaching aproach is two-stage. Stage 1 trains with Lighthouse choice enabled for almost all of the step funds. Stage 2 resumes the Stage 1 checkpoint underneath dense SDPA (identical optimizer state, identical dataloader) for a brief tail. If Stage 1 had hollowed out the mannequin’s dense-attention functionality, Stage 2 restoration would fail.

It doesn’t fail. Testing at a complete funds of 16,000 steps (~50.3B tokens), three cut up factors (10k+6k, 11k+5k, 12k+4k) had been evaluated in opposition to a dense-from-scratch SDPA baseline. At every resume level the coaching loss spikes transiently by 1.12–1.57 nats because the mannequin is first run via consideration it was not skilled in opposition to, then recovers inside roughly 1,000–1,500 SDPA steps and crosses beneath the dense baseline. By step 16,000, all three resumed Lighthouse runs attain last losses of 0.6980–0.7102, in opposition to the dense baseline’s 0.7237, whereas spending 22.5h to 27.0h wall-clock in comparison with 37.9h for dense-SDPA-from-scratch on the identical token funds.

Ablations and throughput

The complete ablation grid covers scorer sort, pooling issue p, variety of pyramid ranges L, and top-Okay funds okay. Key findings: the projection-norm scorer is inside ~0.01 of the dilated softmax-attention scorer in both course (no uniform winner) however is roughly 9% cheaper in B200-hours, because it skips the eye cross over the pyramid completely. Shallower pyramids (L=3) constantly outperform deeper ones (L=4, L=5) at matched budgets. Smaller okay values produce decrease post-resume loss inside the examined vary — the lowest-loss configuration throughout the grid is L=3, p=2, okay=1536 with the dilated scorer, reaching a last lack of 0.6825 — a counter-intuitive consequence the analysis groups attribute to hierarchical choice performing as a regularizer at this token funds scale.

Stage-1 throughput throughout the ablation grid ranges from 84,000 to 126,000 tokens/s/GPU in opposition to roughly 46,000 for dense SDPA. The projection-norm scorer at L=3, p=4, okay=1536 tops the vary at 126,000 tokens/s/GPU by skipping the dilated-attention cross completely.

Lengthy-context retrieval

To enhance the loss-based recoverability outcomes, the analysis group ran a simplified Needle-in-a-Haystack (NIAH) analysis: a single passkey digit hidden in random alphanumeric filler at depths of 0–100% throughout context lengths of 4K to 96K tokens, with retrieval scored as a one-token argmax over the ten digit tokens (random likelihood: 10%). 4 Lighthouse configurations (various okay ∈ {1536, 2048} and scorer ∈ {dilated, norm} at L=3, p=4) had been examined in opposition to the dense-SDPA-from-scratch baseline. Three of 4 Lighthouse runs match or beat the dense baseline’s imply retrieval fee of 0.72: okay=2048 dilated reaches 0.76, okay=1536 dilated reaches 0.73, and okay=2048 norm matches the baseline at 0.72. Solely okay=1536 norm dips, to 0.65. A sample emerges throughout the grid: bigger okay is the dominant axis for retrieval efficiency, and the norm scorer hurts retrieval greater than it hurts coaching loss on the identical okay. The sensible implication is that the optimum configuration depends upon whether or not the downstream process is loss-driven or retrieval-driven.

Context parallelism scaling

For sequences past ~100K tokens, Lighthouse runs underneath context parallelism (CP). Pyramid pooling, scoring, and top-Okay run shard-locally on every rank with no inter-rank communication, because the coarsest pool window (e.g., 64 tokens) is orders of magnitude smaller than the shard dimension. The gathered sub-sequence is dense, so it participates in customary ring consideration with out sparse-aware collectives — one thing sparse-index-based strategies can’t do with out engineering particular to the sparse format. Context parallelism introduces roughly 10% per-rank throughput overhead from ring rotation, however the Lighthouse vs. SDPA speedup ratio is preserved. The tactic scales to 1M-token coaching throughout 32 Blackwell GPUs (4 nodes, CP diploma 8) with no adjustments to the interior consideration kernel.

Marktechpost’s Visible Explainer

01 / The Downside

Why Lengthy-Context Coaching Is Costly

Each transformer makes use of scaled dot-product consideration (SDPA), which computes a rating between each token and each different token within the sequence. As sequence size N grows, this value scales as Θ(N²) in each compute and reminiscence — it doubles the fee for each ~1.4× enhance in context.

FlashAttention diminished this by utilizing IO-aware tiling that avoids ever materializing the total N×N consideration matrix in high-bandwidth reminiscence, slicing reminiscence footprint considerably. However the underlying Θ(N²) compute scaling is unchanged — the wall continues to be there.

Θ(N²)
SDPA compute & reminiscence scaling

1M
token context frontier fashions goal

32
B200 GPUs wanted for 1M-token coaching

The consequence: groups both prepare at shorter contexts than they need, or spend huge compute budgets on consideration alone. Lighthouse Consideration is a technique that wraps round customary SDPA throughout pretraining to scale back this value, then will get eliminated so the ultimate mannequin is a traditional dense-attention mannequin at inference.

02 / Prior Work

What Current Sparse Consideration Will get Fallacious

A number of strategies already attempt to cut back the eye value by attending to solely a subset of tokens. However most share two design choices that create issues for pretraining.

⚠ Downside 1: Asymmetry

Strategies like NSA, HISA, InfLLM-v2 pool solely keys and values however go away queries at full decision. The hierarchy turns into a compressed reminiscence quite than a real multi-scale illustration. It additionally means the dense consideration name stays O(N·S·d) as an alternative of shrinking additional.

⚠ Downside 2: Kernel Entanglement

Strategies like NSA, DSA, HISA, MoBA embed choice logic inside a customized consideration kernel. This implies they can’t reuse the optimized FlashAttention kernels that GPU tensor cores are constructed round. Each sparse technique ships its personal ahead and backward kernels.

The toughest downside: An inference-only sparse technique is robotically pretty much as good as its dense spine. A training-time sparse technique should reply a tougher query: as soon as coaching is finished, will the ensuing weights nonetheless work as a reliable dense-attention mannequin at inference? Most strategies don’t check this.

Lighthouse Consideration treats this recoverability query as its central correctness criterion.

03 / The Technique

Lighthouse Consideration: Core Concept

Lighthouse is a selection-based hierarchical consideration that wraps round, however doesn’t modify, the eye kernel. It provides a pre-processing step that selects a small subset of tokens, runs inventory FlashAttention on simply that subset, and scatters the output again. On the finish of coaching, you disable Lighthouse and hold the dense mannequin.

Two key design variations from prior work:
✓ Queries, keys, and values are all pooled symmetrically (not simply keys/values)
✓ Choice sits exterior the eye kernel — FlashAttention runs on a traditional dense sub-sequence

21×
quicker ahead cross vs SDPA at 512K context

17.3×
quicker ahead+backward at 512K context

1.69×
end-to-end pretraining wall-clock speedup

The tactic introduces no new learnable parameters and no auxiliary losses. The scoring perform is parameter-free, and the top-Okay choice step is intentionally non-differentiable — no straight-through estimator or Gumbel softmax.

04 / Structure

The 4-Stage Pipeline

A Lighthouse consideration layer replaces the usual SDPA name with 4 phases. Phases 1 and 4 are customized kernels; phases 2 and three are customary PyTorch operations fused by torch.compile.

Pyramid Pool

Common-pool Q, Okay, and V symmetrically into an L-level pyramid with pooling issue p. Stage ℓ has N/pⁿ tokens, every summarizing pⁿ base positions. Whole value: Θ(N). Crucially, the coarsest stage is at all times retained in full to ensure no less than one contributor per base place.

Rating + High-Okay Choice

Every pyramid entry will get two scalar scores utilizing its per-head ℓ₂ norm: one as a question rating, one as a key rating. A fused chunked-bitonic top-Okay kernel selects okay entries collectively throughout all pyramid ranges. This step is non-differentiable — indices carry no gradient.

Dense Collect + FlashAttention

Chosen (Q, Okay, V) triples are gathered right into a contiguous sub-sequence of size S = N/pⁿ⁻¹ + (L−1)·p·okay, then handed to inventory FlashAttention. No customized sparse kernel. The gathered sequence has no holes, which is crucial as a result of queries are additionally compressed.

Scatter-Again

Every output entry is scattered again to the pⁿ base positions it represents by way of an integer-atomic scatter kernel. The output is totally dense. Per-position fan-in is bounded by L no matter okay.

05 / Key Design Selection

Why Symmetric Q/Okay/V Pooling Issues

Most prior hierarchical strategies pool solely Okay and V whereas leaving Q at full decision. Lighthouse swimming pools all three. This isn’t beauty — it adjustments the mathematics of the eye name.

Technique	Question facet	Consideration value
NSA, HISA, InfLLM-v2	Full decision (N)	O(N · S · d)
Lighthouse	Pooled (S)	O(S² · d)

As a result of S ≪ N at lengthy contexts, O(S²·d) is dramatically cheaper than O(N·S·d). At N = 1,000,000 with L=4, p=4, okay=4096, S ≈ 65,000.

The no-holes assure: Compressing queries means each question place will need to have a gradient path. Lighthouse ensures no gaps within the gathered sub-sequence, which prevents coaching instabilities that will come up from tokens with lacking gradients. Uneven strategies that go away Q at full decision don’t face this downside.

At bounded okay, setting L = logᵣ(N/okay) provides complete per-layer compute of Θ(T·d) — the identical asymptotic class as linear consideration and SSMs, however with softmax consideration’s recall properties on the chosen sub-sequence.

06 / Gradient Circulation

Non-Differentiable Choice, Differentiable Coaching

The highest-Okay step is discrete. Lighthouse intentionally doesn’t approximate it with a straight-through estimator or Gumbel softmax. It is a acutely aware design alternative.

What does NOT get gradients

The choice indices and the scoring perform. The ℓ₂ norm scorer is rarely skilled — it has no parameters and receives no gradient sign.

What DOES get gradients

Gradients stream via scatter-back → FlashAttention → collect into the gathered Q̃, Okaỹ, Ṽ and on into W_Q, W_K, W_V.

The consequence: the projection matrices be taught to produce values which are helpful when chosen, not scores which are good at choosing. This avoids the optimization issues — scorer collapse, scorer–consideration misalignment, auxiliary loss tuning — that learnable selectors in NSA and DSA are vulnerable to.

Complexity comparability throughout consideration households (per-layer compute at bounded okay):

Dense softmax: Θ(T² · d)
Log-linear consideration: Θ(T log T · d)
Lighthouse (bounded okay): Θ(T · d)
Linear consideration / SSMs: Θ(T · d)

07 / Coaching Recipe

Two-Stage Coaching and Recoverability

The central declare of Lighthouse is that sparse coaching doesn’t break the mannequin’s capability to make use of dense consideration at inference. The 2-stage recipe is how that is validated.

Stage 1 — Lighthouse pretraining

Prepare for almost all of the step funds with Lighthouse choice energetic. That is the quick stage: ~2× greater throughput than dense SDPA.

Stage 2 — Dense SDPA resumption

Resume the Stage 1 checkpoint underneath customary dense SDPA with the identical optimizer state and dataloader. The loss spikes transiently by 1.12–1.57 nats, then recovers inside ~1,000–1,500 SDPA steps and crosses beneath the dense baseline.

Examined at 16,000 complete steps (~50.3B tokens) on a 530M Llama-3-style mannequin (dmodel=1024, 30 layers, H=8, head dim 128, FFN 1536, byte-level tokenizer, C4 dataset, 98,304-token context) throughout three cut up factors:

Cut up	B200–Hrs	Tok/s (okay)	Ultimate Loss
Dense SDPA baseline	303.2	45.6	0.7237
LH 12k + SDPA 4k	214.7	74.7	0.7102
LH 11k + SDPA 5k	219.6	75.4	0.7001
LH 10k + SDPA 6k	228.0	75.0	0.6980

All three Lighthouse runs beat the dense baseline at matched token budgets.

08 / Implementation Element

Not All Layers Use Lighthouse

An vital element for practitioners: within the 30-layer experimental mannequin, layers {0, 1, 28, 29} retain dense SDPA all through. Solely the remaining 26 layers use Lighthouse. The interior consideration name inside these Lighthouse layers makes use of the identical cuDNN-backed SDPA kernel because the dense baseline.

This implies Lighthouse is a partial alternative, not a full model-wide substitution. The primary and final layers conserving dense consideration is a sensible stabilization alternative — these boundary layers usually carry disproportionate significance for mannequin habits.

Optimizer setup: AdamW, lr 2×10⁻³, β₁=0.9, β₂=0.95, weight decay 0.1, linear warmup over 2k steps, gradient-norm clip 1, bfloat16, FSDP solely.

Chunked-bitonic top-Okay: The kernel produces a stratified top-Okay, not a strict world top-Okay. Rating stream is partitioned into fixed-size chunks; every chunk maintains an in-register buffer. If the globally highest-scoring entries clustered in a single chunk, some are changed by lower-scoring entries from different chunks — guaranteeing each area of the sequence contributes tokens and stopping consideration from collapsing onto a slim span.

S = N / p^(L-1) + (L-1) * p * okay

# Instance: N=1M, L=4, p=4, okay=4096
# S = 1,000,000/64 + 3*4*4096
# S = 15,625 + 49,152 ≈ 65,000  (vs 1,000,000 for full consideration)

09 / Ablations

What the Hyperparameter Sweep Exhibits

The complete ablation grid diverse scorer sort, pooling issue p, pyramid ranges L, and top-Okay funds okay. All configurations used the 10k+6k cut up at 98K context.

Config	Scorer	B200–Hrs	Tok/s (okay)	Ultimate Loss
SDPA baseline	—	303.2	45.6	0.7237
L=3, p=2, okay=1536	Dilated	203.9	93.9	0.6825
L=3, p=4, okay=1536	Dilated	197.2	99.5	0.6881
L=3, p=4, okay=1536	Norm	179.6	126.0	0.6946
L=3, p=2, okay=4096	Dilated	215.7	83.5	0.6951

Key findings from the sweep:

Smaller okay → higher loss (counter-intuitive)
Shallower L=3 beats L=4, L=5
Norm scorer: 9% cheaper, comparable high quality
Each config beats dense baseline

The counter-intuitive discovering on okay: loss decreases monotonically as okay shrinks from 4,096 to 1,536. The authors attribute this to hierarchical choice performing as a regularizer on the 50.3B-token funds. Whether or not this reverses at bigger budgets is left to future work.

10 / Retrieval Analysis

Needle-in-a-Haystack Outcomes

Past coaching loss, the paper evaluates long-context retrieval utilizing a simplified Needle-in-a-Haystack (NIAH) check: a single passkey digit hidden in random alphanumeric filler at depths of 0–100% throughout context lengths of 4K–96K tokens. Retrieval is scored as a one-token argmax over the ten digit tokens. Random likelihood is 10%.

Configuration	Imply Retrieval Price	vs Baseline
Dense SDPA baseline	0.72	—
okay=2048, Dilated scorer	0.76	+0.04
okay=1536, Dilated scorer	0.73	+0.01
okay=2048, Norm scorer	0.72	Matches
okay=1536, Norm scorer	0.65	−0.07

Three of 4 Lighthouse configurations match or beat the dense-from-scratch baseline on retrieval. The norm scorer hurts retrieval greater than it hurts coaching loss on the identical okay. The sensible implication: in case your downstream process is retrieval-heavy, use a bigger okay and the dilated scorer. If optimizing for loss and throughput, the norm scorer with okay=1536 is the higher trade-off.

11 / Scaling

Context Parallelism at 1M Tokens

For sequences past ~100K tokens, the 530M mannequin OOMs on a single B200 no matter consideration technique (activations + gradients + optimizer state). Lighthouse extends to multi-GPU context parallelism (CP) cleanly.

Shard-local pre-attention

Every rank holds a contiguous slice of the sequence. Pyramid pooling, scoring, and top-Okay all run shard-locally. The coarsest pool window (e.g., 64 tokens) is way smaller than the shard dimension (N/W ≈ 128K at N=1M, W=8), so no inter-rank communication is required at this stage.

Customary ring consideration

The gathered sub-sequence is dense, so it participates in customary ring consideration with no sparse-aware collectives. KV shards rotate via the ring as in a completely dense long-context run. Sparse-index-based strategies can’t do that — ring rotation requires a contiguous tensor, which their sparse outputs will not be.

~10%
ring-rotation overhead in CP vs single-device

1M
token coaching context achieved

4×8
nodes × GPUs, CP diploma 8

The Lighthouse vs. SDPA speedup ratio is totally preserved underneath matched CP geometry, carrying the benefit cleanly into the 1M-token regime.

12 / Limitations & Sources

Limitations and Open Instructions

Key limitation: Symmetric Q/Okay/V pooling presumes all queries co-occur in a single ahead cross. Autoregressive decoding presents one question at a time — this violates that assumption. Lighthouse is a training-only technique and depends on the dense-SDPA resumption to supply an inference-ready mannequin. The gathered sub-sequence value is Θ(S²·d): sub-quadratic in N at mounted okay, however not strictly linear. Regimes the place okay should scale with N stay uncharacterized.

Open instructions from the paper:

Uneven sparse resumption (DSA / NSA / MoBA goal)
Per-layer / per-head adaptive okay
Imaginative and prescient, audio, video pyramid extensions
Serving integration (steady batching, KV-cache)

Paper

arXiv:2605.06554
“Lengthy Context Pre-Coaching with Lighthouse Consideration”
Peng, Ghosh, Quesnelle — Nous Analysis

Code

github.com/ighoshsubho/
lighthouse-attention
Patch on upstream torchtitan + 2 new information

Scorer variants: norm, dilated, gla — selectable from config. CP path requires norm scorer.

1 / 12

Key Takeaways

Nous Analysis’s Lighthouse Consideration swimming pools Q, Okay, and V symmetrically throughout a multi-level pyramid — in contrast to NSA and HISA which solely pool Okay and V — slicing the eye name from O(N S d) to O(S² d) and making the costly step inventory FlashAttention on a small dense sub-sequence.
It is a training-only technique: a short dense-SDPA resumption on the finish converts the checkpoint into a traditional full-attention mannequin that matches or beats dense-from-scratch on the identical token funds (last loss 0.6980–0.7102 vs. 0.7237 baseline, 16k steps, ~50.3B tokens).
At 512K context on a single B200, Lighthouse is 21× quicker on the ahead cross and 17.3× quicker on ahead+backward vs. cuDNN SDPA — translating to a 1.40×–1.69× end-to-end pretraining wall-clock speedup.
The highest-Okay choice step is intentionally non-differentiable — no straight-through estimator, no Gumbel softmax — so projection matrices be taught to supply values which are helpful when chosen, to not recreation a learnable scorer.
Scales to 1M-token coaching throughout 32 Blackwell GPUs (4 nodes, CP diploma 8) underneath context parallelism with no adjustments to the interior consideration kernel, as a result of the gathered sub-sequence is dense and participates in customary ring consideration.

Try the Paper, GitHub Repo and Technical details. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Have to companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Connect with us

Source link

Nous Analysis Proposes Lighthouse Consideration: A Coaching-Solely Choice-Primarily based Hierarchical Consideration That Delivers 1.4–1.7× Pretraining Speedup at Lengthy Context

Why Lengthy-Context Coaching Is Costly

What Current Sparse Consideration Will get Fallacious

⚠ Downside 1: Asymmetry

⚠ Downside 2: Kernel Entanglement

Lighthouse Consideration: Core Concept

The 4-Stage Pipeline

Why Symmetric Q/Okay/V Pooling Issues

Non-Differentiable Choice, Differentiable Coaching

What does NOT get gradients

What DOES get gradients

Two-Stage Coaching and Recoverability

Not All Layers Use Lighthouse

What the Hyperparameter Sweep Exhibits

Needle-in-a-Haystack Outcomes

Context Parallelism at 1M Tokens

Limitations and Open Instructions

Paper

Code

$60B AI chip darling Cerebras nearly died early on, burning $8M a month

The haves and have nots of the AI gold rush

Advertising working system Nectar Social raises $30M Collection A led by Menlo

Nous Analysis Proposes Lighthouse Consideration: A Coaching-Solely Choice-Primarily based Hierarchical Consideration That Delivers 1.4–1.7× Pretraining Speedup at Lengthy Context

The core downside with current sparse consideration strategies

How the four-stage pipeline works

Why symmetric pooling adjustments the compute

The 2-stage coaching recipe and recoverability

Ablations and throughput

Lengthy-context retrieval

Context parallelism scaling

Marktechpost’s Visible Explainer

Why Lengthy-Context Coaching Is Costly

What Current Sparse Consideration Will get Fallacious

⚠ Downside 1: Asymmetry

⚠ Downside 2: Kernel Entanglement

Lighthouse Consideration: Core Concept

The 4-Stage Pipeline

Why Symmetric Q/Okay/V Pooling Issues

Non-Differentiable Choice, Differentiable Coaching

What does NOT get gradients

What DOES get gradients

Two-Stage Coaching and Recoverability

Not All Layers Use Lighthouse

What the Hyperparameter Sweep Exhibits

Needle-in-a-Haystack Outcomes

Context Parallelism at 1M Tokens

Limitations and Open Instructions

Paper

Code

Key Takeaways

Related Posts

$60B AI chip darling Cerebras nearly died early on, burning $8M a month

The haves and have nots of the AI gold rush

Advertising working system Nectar Social raises $30M Collection A led by Menlo