Moonshot AI Researchers Introduce Seer: An On-line Context Studying System for Quick Synchronous Reinforcement Studying RL Rollouts

How do you retain reinforcement studying for giant reasoning fashions from stalling on a number of very lengthy, very gradual rollouts whereas GPUs sit below used? a crew of researchers from Moonshot AI and Tsinghua College introduce ‘Seer’, a brand new on-line context studying system that targets a particular techniques bottleneck in reinforcement studying for giant language fashions. In synchronous on coverage setups, the rollout part dominates the price of every iteration. Seer restructures this part and experiences rollout throughput good points of 74 % to 97 % and tail latency reductions of 75 % to 93 % in contrast with a powerful synchronous baseline known as veRL.

https://arxiv.org/pdf/2511.14617

Why synchronous rollout is gradual for reasoning fashions?

Trendy reasoning RL workloads use lengthy chain of thought model outputs. Within the Seer experiments, the researchers apply GRPO to 3 completely different fashions, Moonlight, Qwen2 VL 72B and Kimi K2. These workloads run on 32 compute nodes with 8 H800 GPUs per node. The three duties use 32, 128 and 256 GPUs respectively, with 400, 600 and 800 prompts per iteration and eight or 16 responses per immediate.

Most era size is massive. Moonlight is configured for 65,536 tokens, Qwen2 VL 72B for 40,960 tokens and Kimi K2 for 98,304 tokens. A single lengthy chain of thought request can develop from a number of hundred megabytes of KVCache to tens of gigabytes as decoding progresses. This reminiscence progress forces situations to scale back concurrency or to preempt requests, which triggers costly re decoding.

The analysis crew defines tail requests because the final 10 % of requests to complete in a rollout. For Moonlight and Qwen2 VL 72B, this tail alone can devour as much as 50 % of the overall rollout time within the baseline system. Rollout already dominates iteration time, so this tail impact straight slows RL.

https://arxiv.org/pdf/2511.14617

Seer structure on high of Mooncake and vLLM

Seer retains the RL algorithm similar to synchronous veRL. Every coaching iteration makes use of solely information from the present rollout iteration, so the system preserves on coverage habits. The coaching part makes use of Megatron for distributed optimization. The rollout part makes use of an in home implementation of vLLM because the inference engine.

To help aggressive request scheduling, Seer depends on a World KVCache Pool constructed on the Mooncake disaggregated KVCache structure utilized in manufacturing for Kimi. Mooncake offers a two tier DRAM and SSD KV cache retailer shared throughout inference nodes, which permits Seer emigrate requests with out recomputing prefills.

On high of this substrate, Seer introduces three key mechanisms:

Divided Rollout
Context Conscious Scheduling
Adaptive Grouped Speculative Decoding

These are orchestrated by a Request Buffer, a Context Supervisor and an Inference Engine Pool related to the World KVCache Pool.

https://arxiv.org/pdf/2511.14617

Divided Rollout, tremendous grained scheduling and migration

Standard synchronous rollout assigns complete GRPO teams to inference situations. A bunch is a set of requests that share one immediate. As soon as assigned, a bunch stays on the identical occasion till all responses end. As a result of massive variance in output lengths, this results in load imbalance and lengthy operating stragglers.

Seer breaks teams down in two steps. It first decomposes every group into particular person requests. It then divides every request into a number of chunks based mostly on era size. When the scheduler dispatches a request from the Request Buffer, it units a small max tokens worth corresponding to 8,000 tokens for that chunk. After every chunk, the request is re enqueued till it reaches an finish of sequence token or its authentic max tokens restrict.

As a result of KVCache is saved within the World KVCache Pool, divided requests can transfer between situations at chunk boundaries with out re operating the prefill. The scheduler maintains a concurrency stage that retains reminiscence utilization excessive whereas avoiding preemption. This reduces waste and smooths KVCache utilization throughout the iteration.

Context Conscious Scheduling utilizing group size statistics

The analysis crew observe that completely different requests in the identical group are inclined to have correlated output lengths. Seer makes use of this construction as on-line context. For every immediate group, it designates one request because the speculative request. The scheduler retains speculative requests in a excessive precedence queue and serves them with a smallest first coverage based mostly on generated tokens up to now. Brief requests full rapidly and exit. Lengthy requests stay and establish teams which are potential tail candidates.

The Context Supervisor maintains a size estimate for every group. It updates this estimate to the utmost generated size amongst accomplished requests within the group. If no request has completed, it makes use of the unique max tokens as a conservative certain. As soon as speculative requests are in flight or accomplished, Seer schedules remaining requests with an approximate longest first coverage at group stage. This design achieves throughput and tail habits near an oracle scheduler that is aware of all output lengths prematurely.

https://arxiv.org/pdf/2511.14617

Adaptive Grouped Speculative Decoding

Seer provides Adaptive Grouped Speculative Decoding on high of the earlier two elements to speed up decoding, particularly for lengthy requests within the tail. It introduces a Distributed Grouped Draft Server, or DGDS. DGDS maintains a Compressed Suffix Tree for every group and aggregates token sequences from all requests in that group. Situations asynchronously append generated tokens to DGDS, periodically fetch up to date suffix timber and carry out native speculative decoding based mostly on the shared sample statistics.

The system adjusts draft size and the variety of paths in response to mannequin structure, batch dimension and measured acceptance size. For dense and Combination of Consultants fashions, it pre-computes completely different hypothesis thresholds and makes use of them to certain draft depth for every batch. In late tail phases, concurrency is low, so Seer will increase draft depth and allows multi path drafting to boost accepted tokens per step.

Ablation outcomes present that divided rollout yields as much as 35 % throughput enchancment over the baseline. Including Context Conscious Scheduling will increase this to as much as 47 % over baseline. Enabling grouped speculative decoding raises the overall speedup to 77 % to 87 % over the baseline within the evaluated iteration.

Finish to finish impression on RL coaching

The analysis crew consider Seer on three RL duties constructed on Moonlight, Qwen2 VL 72B and Kimi K2. They run 10 rollout iterations per process and measure output tokens per second and completion time for every rollout. Seer improves rollout throughput by 74 % to 97 % throughout these workloads relative to veRL with the identical RL algorithm and vLLM based mostly inference engine.

Tail latency is decreased by 75 % to 93 %. For reminiscence constrained duties, the baseline system spends as much as half of its time on the final 10 % of requests. Seer removes most of this tail by combining divided rollout, Context Conscious Scheduling and Adaptive Grouped Speculative Decoding on high of the Mooncake based mostly World KVCache Pool.

Key Takeaways

Rollout bottleneck: Seer targets the rollout part of synchronous RL, which accounts for about 63% to 87% of iteration time and is dominated by lengthy tail requests and KV cache fragmentation.
Three core mechanisms: Seer combines divided rollout, context conscious scheduling and adaptive grouped speculative decoding to use output size and sample similarity amongst GRPO responses that share a immediate.
High quality grained scheduling on a worldwide KV cache: Requests are break up into chunks and migrated throughout a Mooncake model World KVCache Pool, which preserves synchronous on coverage RL whereas holding GPU reminiscence utilization excessive and decreasing preemptions.
On-line context for tail latency discount: Group stage size statistics from speculative requests drive context conscious scheduling that approximates an oracle longest first scheduler and sharply reduces the time spent on the final 10 % of requests.
Measured finish to finish good points: On manufacturing grade RL workloads with Moonlight, Qwen2 VL 72B and Kimi K2, Seer improves rollout throughput by 74% to 97% and reduces lengthy tail latency by 75% to 93% relative to a cutting-edge synchronous vLLM based mostly baseline.

Seer is a crucial techniques contribution as a result of it optimizes the rollout part in synchronous RL with out altering the underlying GRPO algorithm, so it preserves on coverage ensures and reproducibility whereas fixing an actual infrastructure bottleneck. The mix of divided rollout, context conscious scheduling and adaptive grouped speculative decoding affords a sensible template for different RL stacks that depend on lengthy chain of thought reasoning fashions and huge KVCache footprints. General, Seer reveals that on-line context studying on the techniques stage is now as vital as mannequin structure for scaling reasoning RL effectively.

Try the Paper here. Be happy to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Source link

Moonshot AI Researchers Introduce Seer: An On-line Context Studying System for Quick Synchronous Reinforcement Studying RL Rollouts

Suspect Arrested for Allegedly Throwing Molotov Cocktail at Sam Altman’s Residence

Battery recycler Ascend Components recordsdata for chapter

NVIDIA Releases AITune: An Open-Supply Inference Toolkit That Robotically Finds the Quickest Inference Backend for Any PyTorch Mannequin

Moonshot AI Researchers Introduce Seer: An On-line Context Studying System for Quick Synchronous Reinforcement Studying RL Rollouts

Why synchronous rollout is gradual for reasoning fashions?

Seer structure on high of Mooncake and vLLM

Divided Rollout, tremendous grained scheduling and migration

Context Conscious Scheduling utilizing group size statistics

Adaptive Grouped Speculative Decoding

Finish to finish impression on RL coaching

Key Takeaways

Related Posts

Suspect Arrested for Allegedly Throwing Molotov Cocktail at Sam Altman’s Residence

Battery recycler Ascend Components recordsdata for chapter

NVIDIA Releases AITune: An Open-Supply Inference Toolkit That Robotically Finds the Quickest Inference Backend for Any PyTorch Mannequin