NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model Built on a Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone

Spread the love

NVIDIA has released Nemotron-Labs-Two TowerA diffusion language model built on a pre-trained autoregressive backbone. It is shipped as an open source under the NVIDIA Nemotron Open Model License. This release targets the throughput bottleneck in text generation.

Autoregressive (AR) models decode one token at a time. That sequential process generation limits throughput. Discrete diffusion language models take another approach. They generate tokens in parallel and refine them iteratively.

Most diffusion language models use one network for two tasks. It represents clean tokens and rejects corrupt tokens at every step. Two Tower separates these jobs into two towers. It maintains 98.7% of the overall benchmark quality of the AR baseline. It also reports 2.42× higher wall-clock generation throughput.

Table of Contents

TL;DR

Tootower splits propagation into a frozen AR reference tower and a trained denoiser tower.
It retains 98.7% AR quality at 2.42× throughput (γ=0.8, S=16, 2×H100).
The denoiser was trained on ~2.1T tokens; 25T was used in the spinal cord.

A checkpoint runs diffusion, mock-AR and AR decoding modes.

Nemotron-Labs-Two Tower

TuTower is a block-wise autoregressive diffusion model. It is accelerated on the Nemotron-3-Nano-30B-A3B, an open-weight hybrid backbone. That backbone connects the Mamba-2, Self-Attention and Mixing of Experts (MOE) layers.

Each tower has 52 layers: 23 Mamba-2, 6 self-attention, and 23 MoE. Released Checkpoint ships both towers, approximately 60B total parameters. Active parameters per token are approximately 3B per tower. The MoE uses 128 regular experts, of which 6 are active, plus 2 shared experts.

Both towers start out as copies of the same Backbone checkpoint. Only the denoiser tower is trained. AR reference tower remains frozen. The denoiser was trained on ~2.1T tokens, a fraction of Backbone’s 25T-token pretraining.

How do two towers work

AR Reference Tower runs quickly and efficiently on committed tokens. This generates the per-layer KV cache and the final Mamba-2 state. This preserves the auto-regressive ability of the spinal cord.

Diffusion denoiser tower refines noisy blocks. Within a block, it uses bidirectional in-block attention. It remains causal with respect to previous clean blocks.

The towers are connected layer by layer. denoiser layer I Cross-attends to the reference tower layer I. This layer-aligned cross-attention provides multi-level access to spinal representations. The former approaches only broadcast the last hidden state.

Two more denoiser modifications matter. The Mamba-2 layers obtain their initial state from the Mamba state of the reference tower. The diffusion timestep controls each layer through adaLN-single time conditioning. That adaLN module only adds ~1.5M parameters.

Generation proceeds block by block. Each block starts like this S [MASK] Token. denoiser refines it Tea phase, then commits it. Reference Tower then processes the committed token to update its cache.

This explains why multiple denoising stages can still outperform one-token decoding. Autoregressive decoding returns exactly one token per step. TooTower offers a number of tokens per stage at the beginning of refinement.

Standard

Evaluations use BF16 on a 2×H100 GPU. The default operating point is confidence unmasking, threshold γ=0.8, block size S=16. The table compares the AR baseline with TwoTower propagation decoding.

Work	Nemotron-3-Nano-30B-A3B(AR)	Nemotron-Labs-Two Tower (Diffusion)
MMLU (5-shot, ACC)	78.56	78.24
MMLU-Pro (5-shot, COT EM)	62.59	60.93
ARC-Challenge (25-shot, ACC_Norm)	91.72	92.66
Winogrande (5-shot, ACC)	76.09	76.09
Race (0-shot, acc)	88.90	88.90
HumanEval (0-shot)	79.27	75.58
MBPP-Hygiene (3-Shot)	74.71	74.28
GSM8K (8-shot, ACC)	92.49	90.14
MATH-500 (4-shot)	84.40	80.60
MMLU Global Lite (5-Shot)	73.97	73.94
MGSM (8-shot, average acc)	80.80	80.40
quality maintained	100%	98.7%
Generation Throughput(×AR)	1.0×	2.42×

Common sense AR remains within about one point of the baseline. Code and math show slight degradation. General knowledge and multilingual scores have recovered or improved slightly. Decreasing γ results in more tokens per step and increases throughput with less quality.

Driving it: three generation modes

The checkpoint exposes three inference paths. The full two-tower spread uses 2 GPUs, approximately 59 GB per GPU in the BF16. The AR-only mode runs on a single 80GB GPU.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
)
# context tower -> GPU 0, denoiser tower -> GPU 1
model.place_towers_on_devices("cuda:0", "cuda:1")
model.eval()

prompt = "France is a country "
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

outputs = model.generate_mask_diffusion(
    inputs["input_ids"], max_new_tokens=128,
    block_size=16, steps_per_block=16, mask_token_id=3,
    temperature=0.1, confidence_threshold=0.8,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

there are three modes generate_mask_diffusion(), generate_mock_ar()And generate_ar(). Committed to spreading masks block_size Tokens per step. Mock-AR and AR give one token per step.

Where it fits: Use cases

The most direct use case is fast batch generation. The data team producing synthetic text may trade a slight drop in quality for throughput. At γ=0.8, that trade is 1.3% quality for 2.42× momentum.

The second use case is to tune the quality-throughput trade-off. According to NVIDIA’s paper, increasing γ preserves more quality. Decreasing γ grants more tokens per step for movement.

The third use case is drop-in customization. The reference tower houses its own LM head for speculative decoding, verification or AR scoring. Teams can run AR and diffusion from a single checkpoint.

strengths and weaknesses

Strength:

Open VET under the NVIDIA Nemotron Open Model License; ready for commercial use
98.7% AR quality is retained at 2.42× throughput at the default operating point

Supports a checkpoint propagation, mock-AR and AR decoding
Denoiser trained on ~2.1T tokens, not full re-pretrain
Sequence-length cache memory scales like AR baseline

Weaknesses:

BF16 requires 2 GPUs and ~59GB per GPU for full two-tower spread
Code and math perform worse than general knowledge (HumanEval 79.27 → 75.58)
Keeping both towers resident increases the fixed model-weight memory footprint

Checkpoint issued before instruction tuning or alignment is a base model
Throughput of more than 3× comes with large quality loss

interactive explainer

check it out paper And weight. Also, feel free to follow us Twitter And don’t forget to join us 150k+ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

Source link

Related Stories

Access Denied

Inaugural Music Technology Research Showcase celebrates work of new graduate program’s initial students | MIT News

OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing

You may have missed

Crew member missing after Navy helicopter makes emergency landing

Unpatched Argo CD Repo-Server Flaw Could Let Attackers Take Over Kubernetes Clusters

Access Denied

Gorilla Technology Completes Acquisition of Shackleton Finance and Launches Gorilla Tech Capital