What would you draw if you could run? reinforcement learning ,RL) After training on 32B LLM in 4-bit NVFP4 – on a single H100 – with BF16-level accuracy and 1.2-1.5× step speedup? NVIDIA researchers (along with collaborators from MIT, HKU, and Tsinghua) have open-sourced QeRL (Quantization-Augmented Reinforcement Learning)A training framework that leads reinforcement learning (RL) after training 4-bit FP4 (NVFP4) Keeping gradient mathematics in high precision through LoRA. research team report >1.5× speedup in rollout phase, ~1.8× end-to-end vs QLORA in a setting, and First demonstration of RL training for 32B policy on a single H100-80GB GPU,

What does QUERL change in a reinforcement learning (RL) loop?,
Most RLHF/GRPO/DAPO pipelines spend most of the time in wall-clock phased (Token generation). QUERL makes policy changes Weight path for NVFP4 (FP4) with dual level scaling and keeps Log/gradient in high precision via LoRASo the backprop remains stable when the sampling path hits the hardware-efficient FP4×BF16 kernel (Marlin). The result is faster prefill/decoding during rollout without maintaining a separate full-precision policy.
Mechanically, the research team is integrated Marlin-based FP4 kernel In both rollout and prefill, while LoRA limits the trainable parameters. It directly targets the phase that dominates RL cost and latency for long logic traces.

Quantification made schedulable as exploration
An original empirical finding: Deterministic FP4 quantization policy maximizes entropyleveling token distribution at the beginning of training and improve exploration Versus 16-bit LoRA and NF4-based QLoRA baselines. To control that effect over time, QeRL has been introduced Adaptive Quantization Noise (AQN),channel-wise gaussian error mapped in layernorm scale parameter and annealed with a exponential scheduleIt preserves kernel fusion during the transition from exploration to exploitation (no additional weight tensors).

In resection, QeRL shows rapid reward increase And higher final score On mathematical-logic tasks under both grpo And dapoAligning with the hypothesis that structured noise in the parameter space can be a useful exploration driver in RL, even though such noise is usually harmful in supervised fine-tuning.
Reported results
But QUEEN2.5 backbone model, the research team shows that nvfp4+lora performs better vanilla lora And QLoRA With rollout throughput and overall training time >2× rollout throughput at 14B/32B QLoRA and against models ~1.8× end-to-end vs QLORA In a representative setup. they also perform Training of 32B policy with GRPPO on a single H100-80GBEnabled by the lower memory footprint of load-only FP4.
Accuracy is competitive with high-precision baselines. for one 7b Model, research team report GSM8K = 90.8% And MATH500 = 77.4%, Crossing 16-bit LoRA and QLoRA under their setup and Full-parameter fine-tuning matchingIn broader mathematics benchmarks (e.g., BigMath), QUERL maintains parity or gains while converging faster due to better exploration.

What is it—and what isn’t?
QeRL is weight only FP4 with lora updateit happens No Claim FP4 precision for log/gradient. benefits focus Rollout/Prefill Throughput And memory footprintwith empirical evidence Quantization-induced entropy When RL exploration helps AQN Controls it during training. Generalization of RL modalities beyond math-logic tasks or security/tool-use depends on reward design and sequence length.
key takeaways
- QeRL combines NVFP4 4-bit weight quantization with LoRA to accelerate the rollout phase and reduce memory footprint, enabling RL for 100B 32B LLM on a single H100-80GB.
- Quantization serves as exploration: FP4 maximizes the policy entropy, while Adaptive Quantization Noise (AQN) schedules the noise channel-wise through layernorm scaling.
- Reported efficiency: >1.5× rollout speedup vs 16-bit LoRA and ~1.8× end-to-end vs QLoRA; >2× rollout throughput vs QLORA on 14B/32B setup.
- The accuracy is maintained: Qwen2.5-7B reaches 90.8% on GSM8K and 77.4% on MATH500, which matches full-parameter fine-tuning under the paper’s setup.
- NVFP4 is a hardware-optimized 4-bit floating format with two-level scaling (FP8 E4M3 block scalers + FP32 tensor scalers), enabling efficient Marlin-based kernels.
QUERL accelerates the RL rollout phase. It quantifies the weights for NVFP4 and maintains updates and logs in high precision using LoRA. It reports >1.5× rollout speedup and can train 32B policies on a single H100-80GB GPU. It adds adaptive quantization noise to create a signal that controls exploration during training. Results are shown primarily on math-logic tasks using GRPO and DAPO. The benefit depends on NVFP4 kernel support like Marlin.
check it out full code here And paperFeel free to check us out GitHub page for tutorials, code, and notebooksAlso, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletterwait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.
🙌 Follow MarketTechPost: Add us as a favorite source on Google.