Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation

Spread the love

The Google AI team, including Google DeepMind researchers, recently released DiffusionGemma, an experimental open model for text generation. This standard uses text diffusion instead of autoregressive decoding. The model is licensed under the Permissive Apache 2.0 License. Google markets it to developers and researchers looking for speed-critical, interactive local workflows. Examples include in-line editing, rapid iteration, and generating non-linear text structures.

Most language models in use today are autoregressive. They generate one token at a time from left to right. Each new token depends on the one before it. DiffusionGemma works differently. It generates entire blocks of text simultaneously, in parallel. On dedicated GPUs, this provides up to 4 times faster generation.

Table of Contents

What is DiffusionGemma?

DiffusionGemma is a 26b mixture of experts (MOE) model. It activates only 3.8B parameters during estimation. It is built on the Gemma 4 backbone, specifically the 26B-A4B architecture. Google integrated a dissemination head on that basis.

The model is multimodal. It processes interleaved text, image and video inputs. It produces text output from those inputs. The context window is 256K tokens, and it supports 140+ languages.

Quantized, the model fits within 18GB of VRAM. This puts it inside the high-end consumer GPU range. On an NVIDIA H100, this reaches 1000+ tokens per second. On the NVIDIA GeForce RTX 5090, this reaches 700+ tokens per second.

Google is very straightforward about the trade-offs. DiffusionGemma prioritizes speed and parallel layout creation. Its overall output quality is lower than the standard Gemma 4. For maximum quality production work, Google still recommends autoregressive Gemma 4.

How does text diffusion work?

Text diffusion borrows its core idea from AI image generators. They start from model scene static and iteratively refine it. DiffusionGemma applies a similar pattern to text generation.

This process takes place in three conceptual stages. First, the model starts with a canvas of random placeholder tokens. Second, it passes over that canvas several times. It locks high-trust tokens and uses them as references. Third, the text is converted into the final output.

Google calls the main mechanism Uniform State Diffusion. High confidence tokens help to resolve adjacent situations during denoising. The entire sequence then comes into focus over several passes.

In practice, the model represents a 256-token canvas in parallel. This finalizes approximately 15-20 tokens per forward pass. That similarity is what drives the throughput gain.

The model uses bidirectional attention during denoising. Every token on the canvas can join every other token. This is a sharp break from the autoregressive model. Those models can only look backward at prior tokens.

That bidirectional reference enables self-correction in real time. If the confidence of a token decreases, the sampler can re-noise it. The model replaces that token on subsequent passes. Autoregressive models cannot do this, because they commit each token once.

architecture

Technological progress here is hardware usage. For local GPU inference, the main constraint is memory bandwidth. Autoregressive models repeatedly load weights from memory per token. During single-user service, the GPU spends most of the time waiting.

DiffusionGemma shifts the bottleneck from memory bandwidth to computation. It drafts and refines the 256-token canvas in parallel. This gives the idle tensor core a larger parallel workload.

The model alternates two attention modes during inference. Uses causal attention to ingest the prefill prompt and write the KV cache. Denoising uses bidirectional attention to refine the canvas.

For longer outputs, the DiffusionGemma block uses autoregressive diffusion. Once the 256-token block is completely denoted, it is committed to the KV cache. The model then starts a new canvas based on prior history. It combines parallel block speed with sequential autoregressive stability.

The architecture is similar to that of the Gemma 4 26B A4B. Developers mainly need to implement a denoising step. This simplifies integration into existing service infrastructure.

A clear example of this is the Sudoku Showcase from Google’s developer guide. Autoregressive models struggle with tight, multivariable constrained puzzles. The base diffusion Gemma model solves about 0% of Sudoku puzzles. Following a simple JAX supervised fine-tuning recipe, the accuracy increases to 80%. The streamlined model also stops before cutting out the inference steps.

Interactive Demo: How DiffusionGemma Decodes in Parallel

The interactive visualizer below shows how DiffusionGemma decodes text compared to standard autoregressive models. Toggle between the two modes and press Run. In auto retrograde mode, tokens fill in one at a time, strictly left to right, taking one forward pass per token – the way most LLMs are generated today. In spread In the mode, the model starts with a canvas of masked placeholder tokens and solves many of them in parallel with each pass, in no fixed order, converging in very few passes. The animation also shows a brief re-noise phase, where a low-confidence token is reset and refined again – a stand-in for the self-correction of the real model, which autoregressive decoding cannot do after a token is committed. Note that this is a conceptual animation, not live model output: the actual DiffusionGemma solves a 256-token canvas and finalizes about 15-20 tokens per forward pass.

interactive · illustrative

View DiffusionGemma Decode in parallel

This is a conceptual animation of the denoising process – not live model output. The actual model solves a 256-token canvas, finalizing ~15-20 tokens per forward pass.

Press Play to begin.

use cases

DiffusionGemma targets specific workloads, not general production quality. Google and ecosystem partners highlight several practical applications:

In-line editing and code filling: Bidirectional focus is suitable for non-linear text structures.
Acute recurrence: Low local latency supports interactive, single-user developer loops.
Long-Context Document Analysis: 256K windows support large input processing.
OCR and Document Parsing: Handles multimodal input images and scanned documents.

Code Generation, Tool Calling, and Agentic Workflows: Unsloth lists these as supported functions.
Constrained Generation: Sudoku, mathematical graphs, and amino acid sequences benefit from parallel attention.

One warning shapes it all. Speedup is designed for local, low-concurrency inference. In high-QPS cloud serving, autoregressive models efficiently saturate the computation. There, parallel decoding provides diminishing returns and may increase service costs.

https://blog.google/innovation-and-ai/technology/developers-tools/dif Fusion-gemma-faster-text-generation/

DiffusionGemma vs Standard Gemma 4

Property	DiffusionGemma (26B-A4B)	Standard Gemma 4 (26B A4B)
generation method	Discrete Text Spread (Parallel)	Autoregressive (token-by-token)
decode interrupt	computation-bound	memory-bandwidth-constrained
parallel unit	256-token canvas per pass	one token per step
Pay attention during decode	bidirectional	Reason (backward only)
self-improvement	yes, by making noise again	No, tokens are committed once
Speed on dedicated GPU	up to 4 times faster	basic
H100 Throughput	1000+ tokens/second	bottom (baseline)
rtx 5090 throughput	700+ tokens/second	bottom (baseline)
output quality	Gemma Under 4	higher; Recommended for production
best fit	local, low-concurrency, interactive	High-quality and high-QPS cloud service
license	Apache 2.0	Gemma Terms

key takeaways

DiffusionGemma is a 26B MOE open model (3.8B active) that generates text through parallel diffusion, not token-by-token.
It runs up to 4x faster on dedicated GPUs: 1000+ tokens/sec on H100, 700+ on RTX 5090.
Bidirectional attention on a 256-token canvas enables self-correction in real time, unlike autoregressive models.

Quantized, it fits 18GB of VRAM with day-zero support in VLM, Transformer, MLX, and Unsloth.
It is experimental and of lower quality than the standard Gemma 4; Google recommends Gemma 4 for production.

MarketTechPost’s visual explainer

Open Model Apache 2.0

DiffusionGemma: A Visual Guide

Google DeepMind’s 26B Open Text Diffusion Model – What it is and how it works.

What is DiffusionGemma?

An experimental open model that generates text through propagation, not token-by-token.

26B Mix of Experts (MOE) Which activates only 3.8B parameters during estimation.
built on gemma 4 spine (26B-A4B) Paired with a diffusion head.

multimodal input — Text, Image and Video — Generating text output.
256K reference window140+ languages, released under Apache 2.0.

original idea

Most LLMs are autoregressive. DiffusionGemma takes a different approach.

autoregressive model Generate tokens one at a time, from left to right.
Each new token depends on the one before it.
DiffusionGemma Generates entire blocks of text simultaneously, in parallel.
On a dedicated GPU, this results in up to 4x faster generation.

How does text diffusion work?

It borrows from image propagation: start with noise, iteratively refine.

1Canvas: The model starts with random placeholder tokens.

2Iterative Refinement: It locks the trusted tokens, using them as references.

3Final Polish: The text is converted to output.

Google calls mechanism uniform phase propagation.
This finalizes ~15-20 tokens per forward pass on a 256-token canvas.

architecture

The win is hardware utilization on the local GPU.

Removes obstruction from memory bandwidth to calculate.
prefill KV cache uses causal attention for writes.
Condemn Uses bidirectional focus to refine the canvas.
Block Autoregressive Diffusion Handles sequences longer than 256 tokens.
Enables bidirectional referencing Self-correction in real time Again through the noise.

Performance and Footprint

Throughput numbers and hardware limits from Google.

1000+ tokens/second On a single NVIDIA H100.
700+ tokens/second On NVIDIA GeForce RTX 5090.
fits inside 18 GB VRAM When quantified.
native nvfp4 (4-bit floating-point) with almost lossless accuracy.
Designed for speedup local, low-concurrency Estimate.

DiffusionGemma vs Standard Gemma 4

Property	DiffusionGemma	Gemma 4
generation	spread (parallel)	auto retrograde
spout	computation-bound	Memory Bandwidth
Attention	bidirectional	causality
self-improvement	Yes (making noise again)	No
Speed (GPU)	up to 4 times faster	basic
output quality	lower	higher (production)

use cases

Built for specific workloads, not for general production quality.

In-line editing and code filling – Suitable for non-linear text.
long-context analysis, OCR, and document analysis.
code generation, tool calling, and agentic workflows.
forced generation – Sudoku increased from 0% to 80% after fine-tuning.

Availability and Tooling

Open weights with day-zero ecosystem support.

Weight on hugging face: google/diffusiongemma-26b-a4b-it.
First Diffusion LLM natively supported in VLLM.
Also Transformers, MLX, and Unsloth; Nemo fine-tuning; llama.cpp soon.
deploy via Google Cloud Model Garden Or Nvidia NIM.

check it out model weight And technical details. we have also made Brief demo for this research paper. Also, feel free to follow us Twitter And don’t forget to join us 150k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us

Source link

Related Stories

Access Denied

Andrew Ng Just Released OpenWorker: An Open-Source, Local-First Desktop AI Coworker That Returns Finished Deliverables Instead of Chat

Access Denied

You may have missed

Syria bus crash death toll rises to 35

Access Denied

I am a 63-year-old semiretired physician. If I saved $2 million for retirement, should my Social Security become optional?

US appeals court rules Trump cannot implement mail-in voting order