NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Supply World Mannequin That Generates Minute-Scale 720p Video on a Single GPU

World fashions (techniques that synthesize life like video sequences from an preliminary picture and a set of actions) have gotten central to embodied AI, simulation, and robotics analysis. The core problem is scaling these techniques to generate minute-long, high-resolution video with out requiring prohibitively massive clusters for each coaching and inference. Best open-source baselines both require multi-GPU inference or sacrifice decision to remain inside compute budgets.

NVIDIA’s SANA-WM straight targets these bottlenecks. Constructed on the SANA-Video codebase and obtainable by means of the NVlabs/Sana GitHub repository, it’s a 2.6B-parameter Diffusion Transformer (DiT) skilled natively for one-minute era at 720p with metric-scale 6-DoF digital camera management. It helps three single-GPU inference variants: a bidirectional generator for high-quality offline synthesis, a chunk-causal autoregressive generator for sequential rollout, and a few-step distilled autoregressive generator for quicker deployment. The distilled variant denoises a 60-second 720p clip in 34 seconds on a single RTX 5090 with NVFP4 quantization.

https://arxiv.org/pdf/2605.15178

The Structure: 4 Core Design Selections

1. Hybrid Linear Consideration with Gated DeltaNet (GDN)

Customary softmax consideration has reminiscence and compute complexity that grows quadratically with sequence size — a major problem when producing 961 latent frames for a 60-second video at 720p. SANA-Video, the predecessor, used cumulative ReLU-based linear consideration, which maintains a constant-size recurrent state. Nevertheless, this has no decay mechanism: all previous frames accumulate with equal weight, inflicting drift over minute-scale sequences.

SANA-WM replaces most consideration blocks with frame-wise Gated DeltaNet (GDN). Not like token-wise GDN utilized in language fashions, SANA-WM’s frame-wise variant processes one total latent body per recurrent step. The GDN replace rule incorporates a decay gate γ (which down-weights stale previous frames) and a delta-rule correction (which updates solely the residual between the goal worth and the present state prediction), protecting the recurrent state at a relentless D×D measurement no matter video size.

To stabilize coaching, the analysis staff introduces an algebraic key-scaling method: keys are scaled by 1/√(D·S), the place D is the pinnacle dimension and S is the variety of spatial tokens per body. This ensures the spectral norm of the transition matrix stays bounded and eliminates the NaN divergence occasions noticed with commonplace L2 key normalization (1/√D) or no scaling in any respect, each of which triggered NaN occasions at steps 16 and 1, respectively.

The ultimate spine interleaves 15 frame-wise GDN blocks with 5 softmax consideration blocks (at layers 3, 7, 11, 15, and 19) throughout 20 complete transformer blocks. The softmax blocks present precise long-range recall the place GDN’s recurrence alone is inadequate.

2. Twin-Department Digicam Management

Digicam-controlled world modeling requires the mannequin to faithfully observe a steady 6-DoF trajectory, not simply align with a textual content description of movement. SANA-WM makes use of two complementary branches that function at totally different temporal charges:

Coarse department (UCPE consideration): Operates on the latent-frame price. For every latent token, it computes a ray-local digital camera foundation from the camera-to-world pose and intrinsics, then applies a Unified Digicam Positional Encoding (UCPE) to the geometric channels of every consideration head. This captures international trajectory construction throughout the complete sequence.
Superb department (Plücker mixing): Addresses a compression mismatch. Every latent token summarizes eight uncooked frames, every with its personal distinct digital camera pose. The positive department computes pixel-wise Plücker raymaps (a 6D illustration: ray route d and second o×d) from all eight uncooked frames inside one VAE temporal stride, packs them right into a 48-channel tensor, and injects this embedding after every self-attention output by way of a zero-initialized projection. This restores intra-stride digital camera movement that the coarse department can not see at latent-frame decision.

Ablations on OmniWorld present that neither department alone matches the twin method: UCPE-only achieves a Digicam Movement Consistency (CamMC) of 0.2453, whereas UCPE + Plücker mixing reaches 0.2047.

3. Two-Stage Technology Pipeline

Stage-1 SANA-WM outputs, whereas spatiotemporally constant, can comprise structural artifacts over lengthy sequences. A second-stage refiner, initialized from the 17B LTX-2 mannequin with rank-384 LoRA adapters fine-tuned on paired artificial and actual video knowledge, corrects these artifacts. It makes use of truncated-σ move matching: stage-1 latents are perturbed with a big beginning noise (σ_start = 0.9), and the refiner learns to map this noisy enter towards the high-fidelity goal. Solely three Euler denoising steps are wanted at inference. The refiner reduces long-horizon visible drift (ΔIQ) from 3.79 to 1.17 on the Easy-Trajectory break up, and from 3.09 to 0.31 on the Arduous-Trajectory break up.

4. Strong Knowledge Annotation Pipeline

Coaching camera-controlled video era requires metric-scale 6-DoF pose annotations, the knowledge not obtainable in commonplace video datasets. The analysis staff modified VIPE (a camera-pose annotation engine) by changing its depth backend with Pi3X (for long-sequence-consistent depth) fused with MoGe-2 (for correct per-frame metric scale). Additionally they prolonged the bundle adjustment stage to deal with focal lengths and principal factors as per-frame variables fairly than shared international intrinsics, enabling extra sturdy annotation on web video with various focal lengths.

The ensuing pipeline processes seven coaching corpus entries drawn from a number of open-source sources: SpatialVID-HQ (actual, 10s clips), DL3DV actual clips (10s), DL3DV GS Refined artificial clips (60s, rendered by way of 3D Gaussian Splatting), OmniWorld (artificial, 60s), Sekai Recreation (artificial, 60s), Sekai Strolling-HQ (actual, 60s), and MiraData (actual, 60s). This yields a complete of 212,975 clips with metric-scale pose annotations. The LTX2-VAE used for compression is 2.0× smaller than ST-DC-AE and eight.0× smaller than Wan2.1-VAE, which straight improves coaching and inference effectivity.

For DL3DV, which incorporates static 3D scene captures fairly than native one-minute movies, the analysis staff match one FCGS 3D Gaussian Splatting reconstruction per scene, designed numerous one-minute digital camera paths, rendered lengthy movies with recognized intrinsics and extrinsics, after which refined the rendered outputs with DiFix3D to scale back splatting artifacts.

Coaching Technique and Infrastructure

SANA-WM’s compute entails two phases on 64 H100 GPUs. First, earlier than DiT coaching, the staff adapts the LTX2 VAE to the SANA-Video SFT coaching knowledge in roughly 50K steps, taking roughly 3.5 days. The primary DiT coaching then follows a four-stage progressive schedule lasting roughly 15 days:

Stage 1 (~2.75 days): Adapt the pre-trained SANA-Video mannequin to the frame-wise GDN structure on brief (5s) video clips. This replaces cumulative linear consideration with the recurrent GDN blocks on a less expensive, short-horizon coaching regime the place failure modes are simpler to diagnose.
Stage 2 (~2 days): Introduce hybrid consideration by changing each fourth GDN block with a typical softmax consideration block on the identical short-clip setting, enhancing the effectivity–high quality trade-off.
Stage 3 (~8 days): Prolong coaching to 961-frame (60-second) sequences and incorporate Twin-Department Digicam Management. Context-Parallel (CP=2) sharding distributes the latent sequence throughout GPUs utilizing prefix-sum composition of GDN transition matrices — a mathematically precise parallelization technique requiring minimal communication overhead.
Stage 4 (~2.5 days): Superb-tune a chunk-causal variant for autoregressive rollout, then apply self-forcing distillation to scale back sampling to 4 denoising steps. Consideration-sink tokens and native temporal home windows are added to the softmax consideration layers to maintain reminiscence and per-chunk latency fixed throughout lengthy rollouts.

Customized fused Triton kernels for GDN scan and gate operations contribute roughly 1.5× to 2× effectivity features all through coaching.

Benchmark Outcomes

The analysis staff introduces a purpose-built 60-second world-model benchmark with 80 preliminary scenes generated by Nano Banana Professional throughout 4 scene classes sport, indoor, outdoor-city, and outdoor-nature (20 per class). Every paired with Easy and Arduous digital camera trajectory splits. The primary analysis makes use of every mannequin’s multi-step, undistilled autoregressive setting.

https://arxiv.org/pdf/2605.15178

On this benchmark, SANA-WM with the second-stage refiner achieves the next throughout each splits:

Digicam accuracy (Easy / Arduous): Rotation error (RotErr) of 4.50° / 8.34°; Translation error (TransErr) of 1.39 / 1.39; CamMC of 1.41 / 1.44 — the most effective amongst all in contrast strategies, together with LingBot-World (14B+14B parameters, 8 GPUs) and HY-WorldPlay (8B parameters, 8 GPUs).
Visible high quality: 80.62 / 81.89 VBench Total on Easy / Arduous splits, akin to LingBot-World (81.82 / 81.89) whereas producing 720p outputs on a single GPU per clip.
Throughput: 22.0 movies/hour on 8 H100s with the complete pipeline (refiner included), in comparison with 0.6 movies/hour for LingBot-World — a 36× throughput benefit.
Reminiscence: The total pipeline suits in 74.7 GB, throughout the 80 GB H100 funds. Stage-1-only inference suits in 51.1 GB.
Temporal stability: After refinement, ΔIQ (imaging high quality degradation from first to final 10-second window) drops to 1.17 on Easy and 0.31 on Arduous, in comparison with 23.59 and 25.88 for HY-WorldPlay.

Marktechpost’s Visible Explainer

01 / 09 • Overview

What Is SANA-WM?

SANA-WM is an open-source world mannequin from NVIDIA that takes a single picture and a digital camera trajectory as enter, then synthesizes a sensible 60-second, 720p video that faithfully follows that trajectory. Consider it as: one picture — infinite explorable worlds.

Most world fashions both require massive multi-GPU inference clusters or sacrifice decision to remain inside funds. SANA-WM makes minute-scale, 720p, camera-controlled era sensible — coaching on 64 H100 GPUs and operating inference on a single GPU.

2.6B
Parameters (open-source)

720p
Native output decision

60s
Native era size

Key perception: SANA-WM treats effectivity as a first-class goal — not an afterthought. Its distilled variant denoises a full 60-second 720p clip in 34 seconds on a single RTX 5090 with NVFP4 quantization.

02 / 09 • The Downside

Why Current World Fashions Fall Brief

Producing a 60-second video at 720p means modeling 961 latent frames. Customary softmax consideration — the default in most video diffusion fashions — has reminiscence and compute that grows quadratically with sequence size. At minute scale, this runs out of reminiscence on any single GPU.

Mannequin	Params	Res	GPUs	Throughput
LingBot-World	14B+14B	480p	8	0.6 vids/hr
HY-WorldPlay	8B	480p	8	1.1 vids/hr
Matrix-Recreation 3.0	5B	720p	8	3.1 vids/hr
SANA-WM	2.6B	720p	1	24.1 vids/hr

SANA-WM solves this with 4 architectural designs working collectively: hybrid linear consideration, dual-branch digital camera management, a two-stage refinement pipeline, and a strong knowledge annotation pipeline.

03 / 09 • Structure

Design 1: Hybrid Linear Consideration with Gated DeltaNet (GDN)

Customary softmax consideration grows quadratically with context size. SANA-Video (the predecessor) used cumulative ReLU-based linear consideration — fixed reminiscence, however no decay mechanism: all previous frames accumulate with equal weight, inflicting drift at minute scale.

SANA-WM introduces frame-wise Gated DeltaNet (GDN). Not like token-wise GDN (utilized in LLMs), every recurrent step processes a complete latent body. It provides two corrections to the recurrent state:

γDecay gate — forgets stale past-frame content material by multiplying the earlier state by a realized decay scalar.
βDelta-rule correction — updates solely the residual between the goal worth and the present state prediction, not the complete state.

The state stays D×D no matter video size. To forestall gradient instability, keys are scaled by 1/√(D·S), the place D is head dimension and S is spatial tokens per body. With out this, NaN occasions seem at coaching step 1.

Remaining spine: 20 transformer blocks complete — 15 frame-wise GDN blocks + 5 softmax consideration blocks at layers {3, 7, 11, 15, 19}. The softmax blocks anchor long-range spatial consistency the place GDN alone is inadequate.

04 / 09 • Structure

Design 2: Twin-Department Digicam Management

Digicam-controlled world modeling requires devoted adherence to a steady 6-DoF trajectory — not simply text-described movement. SANA-WM makes use of two complementary branches working at totally different temporal charges:

🌎 Coarse Department — UCPE

Operates at latent-frame price. Computes a ray-local digital camera foundation from the camera-to-world pose and intrinsics. Applies Unified Digicam Positional Encoding (UCPE) to the geometric channels of every consideration head. Captures international 6-DoF trajectory construction throughout the complete sequence.

📷 Superb Department — Plücker Mixing

Addresses a compression mismatch: every latent token summarizes 8 uncooked frames, every with a definite digital camera pose. Computes pixel-wise Plücker raymaps (a 6D illustration: ray route d and second o×d) from all 8 uncooked frames per VAE temporal stride, packs them right into a 48-channel tensor, and injects this after every self-attention output by way of a zero-initialized projection.

Digicam Encoding	RotErr ↓	TransErr ↓	CamMC ↓
No management	16.93	0.2347	0.4937
Plücker solely	16.02	0.2340	0.4742
UCPE solely	7.73	0.1350	0.2453
UCPE + Plücker	6.21	0.1162	0.2047

05 / 09 • Structure

Design 3: Two-Stage Technology Pipeline

Stage-1 SANA-WM outputs are spatiotemporally constant, however can comprise structural artifacts over lengthy sequences. A devoted second-stage refiner corrects these.

1

Initialization: Refiner begins from the 17B LTX-2 mannequin with rank-384 LoRA adapters utilized to consideration (Q/Ok/V/O) and feed-forward projections. LoRA-only fine-tuning retains it light-weight vs. full 17B optimization.
2

Truncated-σ move matching: Stage-1 latents are perturbed with massive beginning noise (σ_begin=0.9). The refiner learns to map this noisy enter towards the high-fidelity goal — refinement over full reconstruction.
3

Inference: Solely 3 Euler denoising steps wanted. LoRA adapters are merged into the distilled LTX-2 base — minimal impression on end-to-end throughput.

1.17
ΔIQ after refiner (Easy break up) vs 3.79 earlier than

0.31
ΔIQ after refiner (Arduous break up) vs 3.09 earlier than

22.0
Movies/hr on 8 H100s (full pipeline)

ΔIQ = imaging-quality rating within the first 10s window minus the final 10s window. Decrease = much less degradation over the minute.

06 / 09 • Structure

Design 4: Strong Knowledge Annotation Pipeline

Coaching camera-controlled era requires metric-scale 6-DoF pose annotations — data not obtainable in commonplace video datasets. The staff modified the VIPE pose annotation engine:

Depth backend improve

Changed single-frame Metric3D-Small with Pi3X (long-sequence-consistent 3D construction) fused with MoGe-2 (correct per-frame metric scale). Fused by fixing for a per-frame scale issue minimizing weighted depth error, smoothed by way of exponential shifting common (momentum 0.99).

Per-frame intrinsics

Prolonged bundle adjustment to deal with focal lengths and principal factors as per-frame variables fairly than shared international intrinsics — enabling sturdy annotation on web video with various focal lengths.

Supply	Kind	Period	Clips
SpatialVID-HQ	Actual	10s	158,369
DL3DV (actual)	Actual	10s	5,691
DL3DV (GS Refined)	Artificial	60s	14,881
OmniWorld	Artificial	60s	1,720
Sekai Recreation	Artificial	60s	3,560
Sekai Strolling-HQ	Actual	60s	9,767
MiraData	Actual	60s	18,987
Whole	—	—	212,975

07 / 09 • Coaching

Progressive Coaching Pipeline

Coaching has two phases on 64 H100 GPUs. First, a VAE pre-adaptation step (~3.5 days, 50K steps) adapts the LTX2 VAE to the SANA-Video SFT knowledge. Then the primary DiT coaching proceeds in 4 progressive phases (~15 days):

1

Body-wise GDN (~2.75 days): Adapt SANA-Video to the GDN recurrent structure on brief 5s clips. The LTX2-VAE is 2.0× smaller than ST-DC-AE and eight.0× smaller than Wan2.1-VAE, chopping token depend earlier than any consideration is computed.
2

Hybrid Consideration (~2 days): Exchange each 4th GDN block with softmax consideration on the identical 5s short-clip setting to enhance effectivity—high quality trade-off earlier than scaling up.
3

Minute-Scale + CamCtrl (~8 days): Prolong to 961-frame (60s) sequences with Twin-Department Digicam Management. Context-Parallel (CP=2) sharding makes use of prefix-sum composition of GDN transition matrices — mathematically precise, minimal communication overhead.
4

SFT + Distillation (~2.5 days): Superb-tune a chunk-causal autoregressive variant on ~50K high-quality clips. Apply self-forcing distillation to scale back sampling to 4 denoising steps. Add attention-sink tokens and native temporal home windows to maintain softmax reminiscence fixed throughout lengthy rollouts.

Effectivity: Customized fused Triton kernels for GDN scan and gate operations contribute ~1.5× to 2× throughput features all through all phases.

08 / 09 • Outcomes

Benchmark Outcomes on the 60-Second World-Mannequin Benchmark

Evaluated on 80 scenes (sport, indoor, outdoor-city, outdoor-nature) throughout Easy and Arduous digital camera trajectory splits. Primary desk makes use of the multi-step, undistilled autoregressive setting.

Technique	Res	GPUs	RotErr↓	TransErr↓	CamMC↓	VBench↑	Tput↑
LingBot-World	480p	8	10.47/18.99	2.01/1.65	2.05/1.81	81.82/81.89	0.6
HY-WorldPlay	480p	8	17.89/35.46	2.36/2.34	2.45/2.64	68.82/70.46	1.1
Matrix-Recreation 3.0	720p	8	12.96/18.79	1.83/1.67	1.92/1.82	78.53/78.79	3.1
SANA-WM+refiner	720p	1	4.50/8.34	1.39/1.39	1.41/1.44	80.62/81.89	22.0

Values proven as Easy/Arduous break up. RotErr in levels. Tput = movies/hour on 8 H100s. Full pipeline reminiscence: 74.7 GB — throughout the 80 GB H100 funds.

Greatest Digicam Accuracy
36× Greater Throughput vs LingBot-World
720p on 1 GPU
Comparable Visible High quality

09 / 09 • Entry

Tips on how to Entry SANA-WM

SANA-WM is open-source and obtainable by means of the NVlabs/Sana GitHub repository (Apache 2.0 license for code; particular person dataset and weight licenses fluctuate — see Desk 11 of the paper). The repo additionally hosts SANA, SANA-1.5, SANA-Dash, and SANA-Video.

# Clone the repo
git clone https://github.com/NVlabs/Sana.git
cd Sana && ./environment_setup.sh sana

Three Inference Variants

▶ Bidirectional — high-quality offline synthesis (very best quality, 49.2 GB)

▶ Chunk-causal AR — sequential rollout for streaming (51.1 GB)

▶ Distilled AR + NVFP4 — 34s per 60s clip on RTX 5090

Sources

📄 Paper: arXiv:2605.15178

🌎 Mission web page: nvlabs.github.io/Sana/WM/

📊 GitHub: github.com/NVlabs/Sana

🤔 Limitations: no specific 3D scene reminiscence; can drift in dynamic scenes or uncommon viewpoints

Sensible workflow recommended by the authors: Search trajectories effectively with the stage-1 mannequin, then selectively refine promising rollouts with the second-stage refiner for increased constancy.

Key Takeaways

NVIDIA’s SANA-WM generates 60-second, 720p, camera-controlled movies on a single GPU — skilled in ~18.5 days on 64 H100s with solely 212,975 public video clips.
A hybrid Gated DeltaNet + softmax consideration spine retains the recurrent state at a relentless D×D measurement no matter video size, fixing the reminiscence explosion that makes minute-scale era impractical with commonplace softmax consideration.
Twin-branch digital camera management — UCPE on the latent-frame price and Plücker mixing on the raw-frame price — brings CamMC right down to 0.2047, the most effective amongst all in contrast strategies together with fashions 5× bigger.
A second-stage refiner initialized from 17B LTX-2 with rank-384 LoRA cuts long-horizon visible drift (ΔIQ) from 3.09 to 0.31 on Arduous trajectories utilizing simply 3 Euler denoising steps.
At 22.0 movies/hour on 8 H100s, SANA-WM + refiner delivers 36× increased throughput than LingBot-World (14B+14B, 8 GPUs) at comparable VBench visible high quality scores.

Try the Paper, GitHub Repo and Project Page. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Connect with us

Source link

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Supply World Mannequin That Generates Minute-Scale 720p Video on a Single GPU

What Is SANA-WM?

Why Current World Fashions Fall Brief

Design 1: Hybrid Linear Consideration with Gated DeltaNet (GDN)

Design 2: Twin-Department Digicam Management

Design 3: Two-Stage Technology Pipeline

Design 4: Strong Knowledge Annotation Pipeline

Progressive Coaching Pipeline

Benchmark Outcomes on the 60-Second World-Mannequin Benchmark

Tips on how to Entry SANA-WM

Nous Analysis Proposes Lighthouse Consideration: A Coaching-Solely Choice-Primarily based Hierarchical Consideration That Delivers 1.4–1.7× Pretraining Speedup at Lengthy Context

$60B AI chip darling Cerebras nearly died early on, burning $8M a month

The haves and have nots of the AI gold rush

NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Supply World Mannequin That Generates Minute-Scale 720p Video on a Single GPU

The Structure: 4 Core Design Selections

1. Hybrid Linear Consideration with Gated DeltaNet (GDN)

2. Twin-Department Digicam Management

3. Two-Stage Technology Pipeline

4. Strong Knowledge Annotation Pipeline

Coaching Technique and Infrastructure

Benchmark Outcomes

Marktechpost’s Visible Explainer

What Is SANA-WM?

Why Current World Fashions Fall Brief

Design 1: Hybrid Linear Consideration with Gated DeltaNet (GDN)

Design 2: Twin-Department Digicam Management

Design 3: Two-Stage Technology Pipeline

Design 4: Strong Knowledge Annotation Pipeline

Progressive Coaching Pipeline

Benchmark Outcomes on the 60-Second World-Mannequin Benchmark

Tips on how to Entry SANA-WM

Key Takeaways

Related Posts

Nous Analysis Proposes Lighthouse Consideration: A Coaching-Solely Choice-Primarily based Hierarchical Consideration That Delivers 1.4–1.7× Pretraining Speedup at Lengthy Context

$60B AI chip darling Cerebras nearly died early on, burning $8M a month

The haves and have nots of the AI gold rush