Moonshot AI Researchers Introduce Seer: An Online Context Learning System for Fast Synchronous Reinforcement Learning RL Rollouts

Spread the love

How do you keep reinforcement learning for large logic models from stalling on some very long, very slow rollouts while GPU utilization decreases? a team of researchers Moonshot AI and Tsinghua University launch ‘Seer’A new online context learning system that targets a specific system constraint in reinforcement learning for large language models. In synchronous policy setup, the rollout phase dominates the cost of each iteration. Seer has reengineered this phase and reports rollout throughput gains of 74 percent to 97 percent and tail latency reduction of 75 percent to 93 percent compared to a strongly synchronous baseline called VERL.

Table of Contents

Why is synchronous rollout slow for reasoning models?,

Modern logic RL workloads use long ranges of idea style outputs. In SEER experiments, researchers apply GRPO on three different models, Moonlight, Quen2 VL72B, and Kimi K2. These workloads run on 32 compute nodes with 8 H800 GPUs per node. The three tasks use 32, 128 and 256 GPUs, respectively, 400, 600 and 800 signals per iteration and 8 or 16 responses per signal.

The maximum generation length is large. Moonlight is configured for 65,536 tokens, Qwen2 VL 72B for 40,960 tokens, and Kimi K2 for 98,304 tokens. As decoding progresses, a long chain of idea requests can grow from a few hundred megabytes to tens of gigabytes of KVCache. This memory increase forces instances to reduce concurrency or preempt requests, which triggers costly re-decoding.

The research team defines tail requests as the last 10 percent of requests that expire in a rollout. For Moonlight and Quen2 VL72B, this tail alone can consume up to 50 percent of the total rollout time in the baseline system. Rollout already dominates iteration time, so this tail effect directly slows down RL.

View the architecture on top of Mooncake and VLLM

Ser keeps the RL algorithm similar to synchronous VERL. Each training iteration uses only data from the current rollout iteration, so the system maintains the policy behavior. The training phase uses Megatron for distributed optimization. The rollout phase uses an in-house implementation of VLLM as the estimation engine.

To support aggressive request scheduling, the Mooncake used in production of Seer KM relies on a global KVCache pool built on different KVCache architectures. Mooncake provides a two-tier DRAM and SSD KV cache store that is shared across inference nodes, which allows the seer to move requests without recalculating the prefill.

On top of this substrate, Seer introduces three major mechanisms,

split rollout
Context Aware Scheduling
Adaptive Grouped Speculative Decoding

These are organized by a request buffer, a context manager, and an inference engine pool connected to the global KVCache pool.

Segmented rollouts, granular scheduling, and migration

Traditional synchronous rollout assigns entire GRPO groups to instance instances. A group is a group of requests that share a signal. Once assigned, a group remains on the same instance until all responses are finished. Due to the large difference in output lengths, it causes load imbalance and long-running stragglers.

The observer breaks the groups into two stages. It first decomposes each group into separate requests. It then divides each request into several parts based on generation length. When the scheduler sends a request from the request buffer, it sets a small maximum token value for that chunk, such as 8,000 tokens. After each segment, the request is re-queued until it reaches the end of the sequence tokens or its original maximum token limit.

Because KVCache is stored in the global KVCache pool, partitioned requests can move between instances across segment boundaries without re-running the prefill. The scheduler maintains a concurrency level that keeps memory usage high while avoiding preemption. This reduces waste and streamlines the use of KVCache throughout the iteration.

Context Aware Scheduling Using Group Length Statistics

The research team found that the output lengths of different requests in the same group are correlated. Sears uses this structure as an online reference. For each prompt group, it designates one request as a speculative request. The scheduler places speculative requests in a higher priority queue and serves them with the smallest first policy based on the tokens generated so far. Small requests are quickly fulfilled and sent out. Long requests remain and identify groups that are potential final candidates.

The context manager maintains a length estimate for each group. It updates this estimate to the maximum generated length among completed requests in the group. If a request has not expired, it uses the original maximum tokens as a conservative limit. Once speculative requests are in flight or are done, Seer schedules the remaining requests with the estimated longest first policy at the group level. This design achieves throughput and tail behavior close to that of an Oracle scheduler that knows all output lengths in advance.

Adaptive Grouped Speculative Decoding

Seer adds Adaptive Grouped Speculative Decoding on top of the previous two components to speed up decoding, especially for long requests in the tail. It features a Distributed Grouped Draft Server or DGDS. DGDS maintains a compressed suffix tree for each group and collects token sequences from all requests in that group. Examples add asynchronously generated tokens to DGDS, fetch periodically updated suffix trees and perform local speculative decoding based on shared pattern statistics.

The system adjusts the draft length and number of paths according to the model architecture, batch size, and measured acceptance length. For expert dense and mixture models, it pre-calculates different speculation limits and uses them to limit the depth of draft for each batch. In late stage stages, concurrency is low, so Seer increases the draft depth and enables multi path drafting to increase the tokens allowed per stage.

Ablation results show that the split rollout improves throughput by 35 percent over baseline. Adding context-aware scheduling increases this to 47 percent over the baseline. Enabling grouped speculative decoding increases the overall speedup from 77 percent to 87 percent over the baseline in the evaluated iterations.

End-to-end impact on RL training

The research team evaluates Seer on three RL tasks built on Moonlight, Quen2 VL72B and Kimi K2. They run 10 rollout iterations per task and measure the output tokens per second and completion time for each rollout. Seer improves rollout throughput in these workloads by 74 percent to 97 percent relative to VERL with similar RL algorithms and VLLM-based inference engines.

Tail latency has reduced from 75 percent to 93 percent. For memory constrained tasks, the baseline system spends half its time on the last 10 percent of requests. Seer removes most of this tail by combining partitioned rollout, context aware scheduling, and adaptive grouped speculative decoding on top of a Mooncake-based global KVCache pool.

key takeaways

rollout barrier:Seer targets the rollout phase of synchronous RL, which accounts for approximately 63% to 87% of iteration time and is dominated by long tail requests and KV cache fragmentation.
three main mechanisms:Seer combines partitioned rollout, context aware scheduling, and adaptive grouped speculative decoding to exploit output length and pattern similarity between GRPO responses that share a signal.
Granular scheduling on global KV cache: Requests are split into chunks and migrated to a Mooncake style global KVCache pool, preserving synchrony on policy RL while keeping GPU memory usage high and reducing preemption.

Online reference for tail latency reduction: Group level length statistics from speculative requests drive context aware scheduling that estimates Oracle’s longest scheduler first and exponentially reduces the time spent on the last 10 percent of requests.
End-to-end benefits measured: On production grade RL workloads with Moonlight, Qwen2 VL72B, and KMK2, Seer improves rollout throughput by 74% to 97% and reduces long tail latency by 75% to 93% relative to state-of-the-art synchronous VLLM-based baselines.

Seer is an important system contribution because it optimizes the rollout phase in synchronous RL without changing the underlying GRPO algorithm, so it preserves policy guarantees and reproducibility while fixing the actual infrastructure bottleneck. The combination of partitioned rollout, context aware scheduling, and adaptive grouped speculative decoding provides a practical template for other RL stacks that rely on long chains of thought-reasoning models and large KVCache footprints. Overall, Sear shows that learning online context at the system level is now as important as model architecture for efficiently scaling reasoning RL.

check it out paper hereFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.

Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.

🙌 Follow MarketTechPost: Add us as a favorite source on Google.

Source link

Related Stories

Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages

Access Denied

In game theory, generalists sometimes win out over specialists | MIT News

You may have missed

Andy Burnham could soon challenge Keir Starmer as the Labour leader

Unpatchable ‘usbliter8’ Exploit Breaks Apple A12 and A13 SecureROM Boot Chain

Access Denied

Israel and Hezbollah agree ceasefire as US-Iran talks stall