Accelerating Reinforcement Learning: NVIDIA’s Speculative Decoding Delivers Double-Digit Speedups

NVIDIA researchers have successfully embedded speculative decoding directly into the reinforcement learning (RL) post-training pipeline, significantly cutting down rollout generation times without compromising model fidelity. The breakthrough is now integrated into NeMo RL v0.6.0, offering lossless acceleration for large language models tasked with complex reasoning and code generation.

The Rollout Generation Bottleneck

In synchronous RL workflows, the rollout generation phase typically consumes 65% to 72% of the total training step duration. With log-probability recalculations and policy optimization making up the remainder, rollout generation stands out as the primary constraint. NVIDIA’s analysis of Qwen3-8B across two distinct workloads—RL-Zero, which builds reasoning capabilities from the ground up, and RL-Think, which extends existing reasoning skills—confirmed that accelerating this specific stage is the most effective way to improve overall training throughput.

How Speculative Decoding Works in NeMo RL

Rather than relying on off-policy data or lower-precision calculations, the research team leveraged speculative decoding to maintain the exact autoregressive output distribution of the target model. A compact draft model, powered by the EAGLE-3 framework, proposes multiple tokens simultaneously. The primary verifier model then validates these tokens through rejection sampling. Because the sampling process is mathematically designed to mirror the target model’s native generation, the training signal remains completely unaltered.

To manage the complexity of continuously updating the draft model alongside a shifting policy, NeMo RL employs a dual-path architecture. It supports both the EAGLE-3 drafting framework for standard models and native multi-token prediction heads. When online adaptation is active, gradient-detached pathways reuse hidden states from the verifier to train the draft head, ensuring policy gradients remain untouched. The latest software release also ships with support for the SGLang backend, the Muon optimizer, and YaRN long-context training.

Performance Gains at the 8B Scale

Benchmarks conducted across 32 GB200 GPUs (organized into eight NVL72 nodes with four GPUs each) demonstrated substantial improvements. For the RL-Zero workload, generation latency dropped from 100 seconds to 56.6 seconds, marking a 1.8× acceleration. The RL-Think workload saw a reduction from 133.6 seconds to 87.0 seconds, yielding a 1.54× speedup. Since downstream computation times remained constant, these generation gains translated to overall step accelerations of 1.41× and 1.35×, respectively.

Crucially, validation accuracy on the AIME-2024 benchmark tracked identically between speculative and standard autoregressive decoding, validating the lossless claim. Conversely, a model-free n-gram drafting baseline proved counterproductive, delivering only 0.7× and 0.5× speedups despite achieving acceptance lengths of 2.47 and 2.05. This highlights that a high acceptance rate alone cannot overcome verification overhead.

Critical Configuration Variables

The study identifies three operational parameters that dictate real-world performance. First, draft initialization quality outweighs raw drafting capability. A draft model fine-tuned on the DAPO post-training dataset achieved a 1.77× generation speedup on RL-Zero, outperforming a draft initialized on broad conversational datasets like UltraChat and Magpie, which only reached 1.51×.

Second, draft length requires careful calibration. At a length of k=3, the system achieved 1.77× speedup on RL-Zero and 1.53× on RL-Think. Extending the draft to k=5 reduced performance to 1.44× and 0.84×, while k=7 further dropped results to 1.21× and 0.71×. Harder reasoning tasks generate longer, more complex traces that are difficult for drafts to predict, causing verification overhead to outweigh benefits.

Third, online draft adaptation proves most valuable when starting with a weaker initialization. While DAPO-initialized drafts saw negligible differences between offline and online updates (1.77× vs. 1.78×), UltraChat-initialized drafts improved from 1.51× to 1.63× when updated online.

Projected Scaling to 235B Parameters

Testing speculative decoding alongside asynchronous execution on a 16-node setup (12 generation, 4 training nodes) with a policy lag of 1 revealed complementary benefits. Asynchronous execution already masks much of the rollout cost, but speculation shrank the exposed generation time from 10.4 seconds to 0.6 seconds per step, lowering total step duration from 75.0 seconds to 60.5 seconds (a 1.24× gain).

Simulator projections for Qwen3-235B-A22B running on 512 GB200 GPUs indicate a 2.72× rollout speedup and 1.70× end-to-end acceleration when using a draft length of k=3. Under optimal asynchronous conditions across 2048 GB200 GPUs with a policy lag of 2, rollout speeds could reach approximately 3.5×, driving a projected 2.5× improvement in total training time. The research team notes that speculation reduces individual rollout costs while asynchronous overlap conceals the remaining generation time behind training and log-probability computation.

Availability and Implementation

The methodology is now accessible in NeMo RL v0.6.0 under the Apache 2.0 license. Researchers can review the full technical breakdown in the associated arXiv paper (2604.26779) and access the code repository to implement these optimizations across their own reinforcement learning workflows.

MT Labs helps companies across Singapore deploy AI tools they actually own. Private infrastructure, no recurring cloud subscriptions, and a setup built around how your team already works. Whether you need a small assistant for one team or a full agentic AI workflow for the whole company, we size the setup to what you need and what your team can manage. Get in touch and we’ll map it out with you.

Chat with AI

Hello! I'm MTLabs AI, How can I help you today?