NVIDIA has released Nemotron-Labs-Two TowerA diffusion language model built on a pre-trained autoregressive backbone. It is shipped as an open source under the NVIDIA Nemotron Open Model License. This release targets the throughput bottleneck in text generation.
Autoregressive (AR) models decode one token at a time. That sequential process generation limits throughput. Discrete diffusion language models take another approach. They generate tokens in parallel and refine them iteratively.
Most diffusion language models use one network for two tasks. It represents clean tokens and rejects corrupt tokens at every step. Two Tower separates these jobs into two towers. It maintains 98.7% of the overall benchmark quality of the AR baseline. It also reports 2.42× higher wall-clock generation throughput.
TL;DR
- Tootower splits propagation into a frozen AR reference tower and a trained denoiser tower.
- It retains 98.7% AR quality at 2.42× throughput (γ=0.8, S=16, 2×H100).
- The denoiser was trained on ~2.1T tokens; 25T was used in the spinal cord.
- A checkpoint runs diffusion, mock-AR and AR decoding modes.
Nemotron-Labs-Two Tower
TuTower is a block-wise autoregressive diffusion model. It is accelerated on the Nemotron-3-Nano-30B-A3B, an open-weight hybrid backbone. That backbone connects the Mamba-2, Self-Attention and Mixing of Experts (MOE) layers.
Each tower has 52 layers: 23 Mamba-2, 6 self-attention, and 23 MoE. Released Checkpoint ships both towers, approximately 60B total parameters. Active parameters per token are approximately 3B per tower. The MoE uses 128 regular experts, of which 6 are active, plus 2 shared experts.
Both towers start out as copies of the same Backbone checkpoint. Only the denoiser tower is trained. AR reference tower remains frozen. The denoiser was trained on ~2.1T tokens, a fraction of Backbone’s 25T-token pretraining.
How do two towers work
AR Reference Tower runs quickly and efficiently on committed tokens. This generates the per-layer KV cache and the final Mamba-2 state. This preserves the auto-regressive ability of the spinal cord.
Diffusion denoiser tower refines noisy blocks. Within a block, it uses bidirectional in-block attention. It remains causal with respect to previous clean blocks.
The towers are connected layer by layer. denoiser layer I Cross-attends to the reference tower layer I. This layer-aligned cross-attention provides multi-level access to spinal representations. The former approaches only broadcast the last hidden state.
Two more denoiser modifications matter. The Mamba-2 layers obtain their initial state from the Mamba state of the reference tower. The diffusion timestep controls each layer through adaLN-single time conditioning. That adaLN module only adds ~1.5M parameters.
Generation proceeds block by block. Each block starts like this S [MASK] Token. denoiser refines it Tea phase, then commits it. Reference Tower then processes the committed token to update its cache.
This explains why multiple denoising stages can still outperform one-token decoding. Autoregressive decoding returns exactly one token per step. TooTower offers a number of tokens per stage at the beginning of refinement.
Standard
Evaluations use BF16 on a 2×H100 GPU. The default operating point is confidence unmasking, threshold γ=0.8, block size S=16. The table compares the AR baseline with TwoTower propagation decoding.
| Work | Nemotron-3-Nano-30B-A3B(AR) | Nemotron-Labs-Two Tower (Diffusion) |
|---|---|---|
| MMLU (5-shot, ACC) | 78.56 | 78.24 |
| MMLU-Pro (5-shot, COT EM) | 62.59 | 60.93 |
| ARC-Challenge (25-shot, ACC_Norm) | 91.72 | 92.66 |
| Winogrande (5-shot, ACC) | 76.09 | 76.09 |
| Race (0-shot, acc) | 88.90 | 88.90 |
| HumanEval (0-shot) | 79.27 | 75.58 |
| MBPP-Hygiene (3-Shot) | 74.71 | 74.28 |
| GSM8K (8-shot, ACC) | 92.49 | 90.14 |
| MATH-500 (4-shot) | 84.40 | 80.60 |
| MMLU Global Lite (5-Shot) | 73.97 | 73.94 |
| MGSM (8-shot, average acc) | 80.80 | 80.40 |
| quality maintained | 100% | 98.7% |
| Generation Throughput(×AR) | 1.0× | 2.42× |
Common sense AR remains within about one point of the baseline. Code and math show slight degradation. General knowledge and multilingual scores have recovered or improved slightly. Decreasing γ results in more tokens per step and increases throughput with less quality.
Driving it: three generation modes
The checkpoint exposes three inference paths. The full two-tower spread uses 2 GPUs, approximately 59 GB per GPU in the BF16. The AR-only mode runs on a single 80GB GPU.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
)
# context tower -> GPU 0, denoiser tower -> GPU 1
model.place_towers_on_devices("cuda:0", "cuda:1")
model.eval()
prompt = "France is a country "
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate_mask_diffusion(
inputs["input_ids"], max_new_tokens=128,
block_size=16, steps_per_block=16, mask_token_id=3,
temperature=0.1, confidence_threshold=0.8,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
there are three modes generate_mask_diffusion(), generate_mock_ar()And generate_ar(). Committed to spreading masks block_size Tokens per step. Mock-AR and AR give one token per step.
Where it fits: Use cases
The most direct use case is fast batch generation. The data team producing synthetic text may trade a slight drop in quality for throughput. At γ=0.8, that trade is 1.3% quality for 2.42× momentum.
The second use case is to tune the quality-throughput trade-off. According to NVIDIA’s paper, increasing γ preserves more quality. Decreasing γ grants more tokens per step for movement.
The third use case is drop-in customization. The reference tower houses its own LM head for speculative decoding, verification or AR scoring. Teams can run AR and diffusion from a single checkpoint.
strengths and weaknesses
Strength:
- Open VET under the NVIDIA Nemotron Open Model License; ready for commercial use
- 98.7% AR quality is retained at 2.42× throughput at the default operating point
- Supports a checkpoint propagation, mock-AR and AR decoding
- Denoiser trained on ~2.1T tokens, not full re-pretrain
- Sequence-length cache memory scales like AR baseline
Weaknesses:
- BF16 requires 2 GPUs and ~59GB per GPU for full two-tower spread
- Code and math perform worse than general knowledge (HumanEval 79.27 → 75.58)
- Keeping both towers resident increases the fixed model-weight memory footprint
- Checkpoint issued before instruction tuning or alignment is a base model
- Throughput of more than 3× comes with large quality loss
interactive explainer
check it out paper And weight. Also, feel free to follow us Twitter And don’t forget to join us 150k+ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us