NVIDIA AI Brings Nemotron-3-Nano-30B to NVFP4 with Quantization Aware Distillation (QAD) for Efficient Reasoning Inference

Spread the love

NVIDIA has released Nemotron-Nano-3-30B-A3B-NVFP4A production checkpoint that runs a 30b parameter reasoning model 4 bit nvfp4 Format keeping the accuracy close to your BF16 baseline. The model combines a hybrid Mamba2 Transformer Mix of experts architecture with one Quantization Aware Distillation (QAD) The recipe is specifically designed for NVFP4 deployment. Overall, this is a super-efficient NVFP4 precision version of the Nemotron-3-Nano that delivers 4x more throughput over the Blackwell B200.

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

Table of Contents

What is Nemotron-Nano-3-30B-A3B-NVFP4?

Nemotron-Nano-3-30B-A3B-NVFP4 is a quantized version of Nemotron-3-Nano-30B-A3B-BF16Trained from the ground up by the NVIDIA team as an integrated logic and chat model. It is designed as a Hybrid Mamba2 Transformer MOE network:

total 30b parameters
52 layers deep
23 Mamba2 and MoE layers
6 grouped query attention layers with 2 groups
Each MoE layer has 128 routed experts and 1 shared expert

There are 6 experts active per token, which gives approximately 3.5B active parameters per token

The model is pre-trained on 25T token using a warmup steady decay Learning rate schedule with batch size of 3072, peak learning rate of 1e-3 and minimum learning rate of 1e-5.

After training a 3 stage pipeline is followed:

Supervised fine tuning On synthetic and curated data for code, math, science, tool calling, instruction following, and structured output.

reinforcement learning With multi step tool usage, multi turn chat and RLHF with synchronous GRPO and generative reward model in a structured environment.
Quantization after training Up to NVFP4 with FP8 KV cache and a selective high precision layout, followed by QAD.

NVFP4 keeps the checkpoint attention layers and the Mamba layers that feed them into BF16, quantizes the remaining layers in NVFP4 and uses FP8 for the KV cache.

NVFP4 format and why it matters?

nvfp4 there is one 4 bit floating point A format designed for both training and inference on recent NVIDIA GPUs. Main features of NVFP4:

Compared to FP8, NVFP4 delivers 2 to 3 times higher arithmetic throughput.
It reduces memory usage to almost 1.8 times For weight and activity.

It extends MXFP4 by reducing Block size from 32 to 16 and introduces two level scaling.

Uses two level scaling E4M3-FP8 scale per block and a FP32 scale per tensor. The small block size allows the quantizer to adapt to local data and the dual scaling increases dynamic range while keeping quantization error low.

For very large LLM, simple Post-Training Quantization (PTQ) NVFP4 already gives good accuracy in all benchmarks. For smaller models, especially those with heavy postal pipelines, the research team notes that PTQ causes non-negligible accuracy degradationWhich motivates the training based recovery method.

QAT to QAD

Standard Quantization Aware Training (QAT) Inserts and reuses a pseudoquantization in the forward pass. basic function lossSuch as the next token cross entropy. This works well for convolutional networks, But the research team lists 2 main issues for the modern LLM:

Complex multi stage post training pipelines are hard to reproduce with SFT, RL and model merging.
The original training data for open models is often in unavailable form.

Quantization Aware Distillation (QAD) Changes the purpose instead of full pipeline. a frozen BF16 Model acts as teacher And the NVFP4 model is a student. training becomes minimal KL divergence between their output token distributions, not the original supervised or RL objective.

The research team highlighted 3 properties of QAD:

It aligns the quantized model with the higher precision teacher more accurately than QAT.
It remains stable even when the teacher has already gone through multiple stages, such as supervised fine tuning, reinforcement learning, and model merging, because QAD only tries to match the final teacher behavior.

It works with partial, synthetic or filtered data, as it requires only the input text to query the teacher and student, not the original labels or reward models.

Benchmarks on Nemotron-3-Nano-30B

Nemotron-3-Nano-30B-A3B is one of the RL heavy models in QAD research. The table below shows the accuracy on AA-LCR, AIME25, GPQA-D, LiveCodeBench-v5 and SciCode-TQ, NVFP4-QAT and NVFP4-QAD.

https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf

key takeaways

Nemotron-3-Nano-30B-A3B-NVFP4 is a 30B parameter hybrid Mamba2 transformer MoE model Which runs in 4bit NVFP4 with an FP8 KV cache and a small set of BF16 layers preserved for stability, while keeping about 3.5B active parameters per token and supporting context windows up to 1M tokens.

NVFP4 is a 4 bit floating point format with block size 16 and two levels of scaling.E4M3-FP8 using one per block scale and one FP32 per tensor scale, which gives about 2 to 3 times higher arithmetic throughput and about 1.8 times lower memory cost than FP8 for weights and activations.
Quantization Aware Distillation (QAD) replaces the original function loss with KL divergence for a frozen BF16 teacher.Hence NVFP4 Student directly matches the teacher’s output distribution without re-running the full SFT, RL and model merge pipelines or requiring the original reward model.
Using the new Quantization Aware Distillation method, NVFP4 version achieves 99.4% accuracy of BF16

On AA-LCR, AIME25, GPQA-D, LiveCodeBench and SciCode, NVFP4-PTQ shows noticeable accuracy loss and NVFP4-QAT becomes even worse.While NVFP4-QAD brings performance back closer to BF16 levels, the gap narrows to just a few points in these logic and coding benchmarks.

check it out paper and model weight. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Source link