Most textual content to video fashions generate a single clip from a immediate after which cease. They don’t preserve an inside world state that persists as actions arrive over time. PAN, a brand new mannequin from MBZUAI’s Institute of Basis Fashions, is designed to fill that hole by appearing as a common world mannequin that predicts future world states as video, conditioned on historical past and pure language actions.
From video generator to interactive world simulator
PAN is outlined as a common, interactable, lengthy horizon world mannequin. It maintains an inside latent state that represents the present world, then updates that state when it receives a pure language motion similar to ‘flip left and pace up’ or ‘transfer the robotic arm to the pink block.’ The mannequin then decodes the up to date state into a brief video section that exhibits the consequence of that motion. This cycle repeats, so the identical world state evolves throughout many steps.
This design permits PAN to assist open area, motion conditioned simulation. It may possibly roll out counterfactual futures for various motion sequences. An exterior agent can question PAN as a simulator, evaluate predicted futures, and select actions primarily based on these predictions.
GLP structure, separating what occurs from the way it seems
The bottom of PAN is the Generative Latent Prediction, GLP, structure. GLP separates world dynamics from visible rendering. First, a imaginative and prescient encoder maps photos or video frames right into a latent world state. Second, an autoregressive latent dynamics spine primarily based on a big language mannequin predicts the following latent state, conditioned on historical past and the present motion. Third, a video diffusion decoder reconstructs the corresponding video section from that latent state.
In PAN, the imaginative and prescient encoder and spine are constructed on Qwen2.5-VL-7B-Instruct. The imaginative and prescient tower tokenizes frames into patches and produces structured embeddings. The language spine runs over a historical past of world states and actions, plus realized question tokens, and outputs the latent illustration of the following world state. These latents reside within the shared multimodal area of the VLM, which helps floor the dynamics in each textual content and imaginative and prescient.
The video diffusion decoder is tailored from Wan2.1-T2V-14B, a diffusion transformer for top constancy video technology. The analysis workforce trains this decoder with a circulate matching goal, utilizing one thousand denoising steps and a Rectified Movement formulation. The decoder situations on each the expected latent world state and the present pure language motion, with a devoted cross consideration stream for the world state and one other for the motion textual content.
Causal Swin DPM and sliding window diffusion
Naively chaining single shot video fashions by conditioning solely on the final body results in native discontinuities and fast high quality degradation over lengthy rollouts. PAN addresses this with Causal Swin DPM, which augments the Shift Window Denoising Course of Mannequin with chunk sensible causal consideration.
The decoder operates on a sliding temporal window that holds two chunks of video frames at completely different noise ranges. Throughout denoising, one chunk strikes from excessive noise to wash frames after which leaves the window. A brand new noisy chunk enters on the different finish. Chunk sensible causal consideration ensures that the later chunk can solely attend to the sooner one, to not unseen future actions. This retains transitions between chunks easy and reduces error accumulation over lengthy horizons.
PAN additionally provides managed noise to the conditioning body, quite than utilizing a superbly sharp body. This suppresses incidental pixel particulars that don’t matter for dynamics and encourages the mannequin to deal with secure construction similar to objects and structure.
Coaching stack and information development
PAN is skilled in two levels. Within the first stage, the analysis workforce adapts Wan2.1 T2V 14B into the Causal Swin DPM structure. They practice the decoder in BFloat16 with AdamW, a cosine schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, and a hybrid sharded information parallel scheme throughout 960 NVIDIA H200 GPUs.
Within the second stage, they combine the frozen Qwen2.5 VL 7B Instruct spine with the video diffusion decoder below the GLP goal. The imaginative and prescient language mannequin stays frozen. The mannequin learns question embeddings and the decoder in order that predicted latents and reconstructed movies keep constant. This joint coaching additionally makes use of sequence parallelism and Ulysses fashion consideration sharding to deal with lengthy context sequences. Early stopping ends coaching after 1 epoch as soon as validation converges, although the schedule permits 5 epochs.
Coaching information comes from extensively used publicly accessible video sources that cowl on a regular basis actions, human object interactions, pure environments, and multi agent eventualities. Lengthy kind movies are segmented into coherent clips utilizing shot boundary detection. A filtering pipeline removes static or overly dynamic clips, low aesthetic high quality, heavy textual content overlays, and display screen recordings utilizing rule primarily based metrics, pretrained detectors, and a customized VLM filter. The analysis workforce then re-captions clips with dense, temporally grounded descriptions that emphasize movement and causal occasions.
Benchmarks, motion constancy, lengthy horizon stability, planning
The analysis workforce evaluates the mannequin alongside three axes, motion simulation constancy, lengthy horizon forecast, and simulative reasoning and planning, towards each open supply and industrial video mills and world fashions. Baselines embrace WAN 2.1 and a pair of.2, Cosmos 1 and a pair of, V JEPA 2, and industrial techniques similar to KLING, MiniMax Hailuo, and Gen 3.
For motion simulation constancy, a VLM primarily based decide scores how properly the mannequin executes language specified actions whereas sustaining a secure background. PAN reaches 70.3% accuracy on agent simulation and 47% on atmosphere simulation, for an general rating of 58.6%. It achieves the best constancy amongst open supply fashions and surpasses most industrial baselines.
For lengthy horizon forecast, the analysis workforce measures Transition Smoothness and Simulation Consistency. Transition Smoothness makes use of optical circulate acceleration to quantify how easy movement is throughout motion boundaries. Simulation Consistency makes use of metrics impressed by WorldScore to observe degradation over prolonged sequences. PAN scores 53.6% on Transition Smoothness and 64.1% on Simulation Consistency and exceeds all baselines, together with KLING and MiniMax, on these metrics.
For simulative reasoning and planning, PAN is used as an inside simulator inside an OpenAI-o3 primarily based agent loop. In step sensible simulation, PAN achieves 56.1% accuracy, the perfect amongst open supply world fashions.
Key Takwaways
- PAN implements the Generative Latent Prediction structure, combining a Qwen2.5-VL-7B primarily based latent dynamics spine with a Wan2.1-T2V-14B primarily based video diffusion decoder, to unify latent world reasoning and life like video technology.
- The Causal Swin DPM mechanism introduces a sliding window, chunk sensible causal denoising course of that situations on partially noised previous chunks, which stabilizes lengthy horizon video rollouts and reduces temporal drift in comparison with naive final body conditioning.
- PAN is skilled in two levels, first adapting the Wan2.1 decoder to Causal Swin DPM on 960 NVIDIA H200 GPUs with a circulate matching goal, then collectively coaching the GLP stack with a frozen Qwen2.5-VL spine and realized question embeddings plus decoder.
- The coaching corpus consists of huge scale video motion pairs from various domains, processed with segmentation, filtering, and dense temporal recaptioning, enabling PAN to study motion conditioned, lengthy vary dynamics as a substitute of remoted brief clips.
- PAN achieves state-of-the-art open supply outcomes on motion simulation constancy, lengthy horizon forecasting, and simulative planning, with reported scores similar to 70.3% agent simulation, 47% atmosphere simulation, 53.6% transition smoothness, and 64.1% simulation consistency, whereas remaining aggressive with main industrial techniques.
Comparability Desk
| Dimension | PAN | Cosmos video2world WFM | Wan2.1 T2V 14B | V JEPA 2 |
|---|---|---|---|---|
| Group | MBZUAI Institute of Basis Fashions | NVIDIA Analysis | Wan AI and Open Laboratory | Meta AI |
| Major function | Basic world mannequin for interactive, lengthy horizon world simulation with pure language actions | World basis mannequin platform for Bodily AI with video to world technology for management and navigation | Prime quality textual content to video and picture to video generator for common content material creation and enhancing | Self supervised video mannequin for understanding, prediction and planning duties |
| World mannequin framing | Specific GLP world mannequin, latent state, motion, and subsequent statement outlined, focuses on simulative reasoning and planning | Described as world basis mannequin that generates future video worlds from previous video and management immediate, aimed toward Bodily AI, robotics, driving, navigation | Framed as video technology mannequin, not primarily as world mannequin, no persistent inside world state described in docs | Joint embedding predictive structure for video, focuses on latent prediction quite than specific generative supervision in statement area |
| Core structure | GLP stack, imaginative and prescient encoder from Qwen2.5 VL 7B, LLM primarily based latent dynamics spine, video diffusion decoder with Causal Swin DPM | Household of diffusion primarily based and autoregressive world fashions, with video2world technology, plus diffusion decoder and immediate upsampler primarily based on a language mannequin | Spatio temporal variational autoencoder and diffusion transformer T2V mannequin at 14 billion parameters, helps a number of generative duties and resolutions | JEPA fashion encoder plus predictor structure that matches latent representations of consecutive video observations |
| Spine and latent area | Multimodal latent area from Qwen2.5 VL 7B, used each for encoding observations and for autoregressive latent prediction below actions | Token primarily based video2world mannequin with textual content immediate conditioning and non-obligatory diffusion decoder for refinement, latent area particulars rely on mannequin variant | Latent area from VAE plus diffusion transformer, pushed primarily by textual content or picture prompts, no specific agent motion sequence interface | Latent area constructed from self supervised video encoder with predictive loss in illustration area, not generative reconstruction loss |
| Motion or management enter | Pure language actions in dialogue format, utilized at each simulation step, mannequin predicts subsequent latent state and decodes video conditioned on motion and historical past | Management enter as textual content immediate and optionally digicam pose for navigation and downstream duties similar to humanoid management and autonomous driving | Textual content prompts and picture inputs for content material management, no specific multi step agent motion interface described as world mannequin management | Doesn’t deal with pure language actions, used extra as visible illustration and predictor module inside bigger brokers or planners |
| Lengthy horizon design | Causal Swin DPM sliding window diffusion, chunk sensible causal consideration, conditioning on barely noised final body to cut back drift and preserve secure lengthy horizon rollouts | Video2world mannequin generates future video given previous window and immediate, helps navigation and lengthy sequences however the paper doesn’t describe a Causal Swin DPM fashion mechanism | Can generate a number of seconds at 480 P and 720 P, focuses on visible high quality and movement, lengthy horizon stability is evaluated by way of Wan Bench however with out specific world state mechanism | Lengthy temporal reasoning comes from predictive latent modeling and self supervised coaching, not from generative video rollouts with specific diffusion home windows |
| Coaching information focus | Giant scale video motion pairs throughout various bodily and embodied domains, with segmentation, filtering and dense temporal recaptioning for motion conditioned dynamics | Mixture of proprietary and public Web movies targeted on Bodily AI classes similar to driving, manipulation, human exercise, navigation and nature dynamics, with a devoted curation pipeline | Giant open area video and picture corpora for common visible technology, with Wan Bench analysis prompts, not focused particularly at agent atmosphere rollouts | Giant scale unlabelled video information for self supervised illustration studying and prediction, particulars in V JEPA 2 paper |
PAN is a crucial step as a result of it operationalizes Generative Latent Prediction with manufacturing scale elements similar to Qwen2.5-VL-7B and Wan2.1-T2V-14B, then validates this stack on properly outlined benchmarks for motion simulation, lengthy horizon forecasting, and simulative planning. The coaching and analysis pipeline is clearly documented by the analysis workforce, the metrics are reproducible, and the mannequin is launched inside a clear world modeling framework quite than as an opaque video demo. General, PAN exhibits how a imaginative and prescient language spine plus diffusion video decoder can perform as a sensible world mannequin as a substitute of a pure generative toy.
Try the Paper, Technical details and Project. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Max is an AI analyst at MarkTechPost, primarily based in Silicon Valley, who actively shapes the way forward for expertise. He teaches robotics at Brainvyne, combats spam with ComplyEmail, and leverages AI each day to translate complicated tech developments into clear, comprehensible insights