The Google AI team, including Google DeepMind researchers, recently released DiffusionGemma, an experimental open model for text generation. This standard uses text diffusion instead of autoregressive decoding. The model is licensed under the Permissive Apache 2.0 License. Google markets it to developers and researchers looking for speed-critical, interactive local workflows. Examples include in-line editing, rapid iteration, and generating non-linear text structures.
Most language models in use today are autoregressive. They generate one token at a time from left to right. Each new token depends on the one before it. DiffusionGemma works differently. It generates entire blocks of text simultaneously, in parallel. On dedicated GPUs, this provides up to 4 times faster generation.
What is DiffusionGemma?
DiffusionGemma is a 26b mixture of experts (MOE) model. It activates only 3.8B parameters during estimation. It is built on the Gemma 4 backbone, specifically the 26B-A4B architecture. Google integrated a dissemination head on that basis.
The model is multimodal. It processes interleaved text, image and video inputs. It produces text output from those inputs. The context window is 256K tokens, and it supports 140+ languages.
Quantized, the model fits within 18GB of VRAM. This puts it inside the high-end consumer GPU range. On an NVIDIA H100, this reaches 1000+ tokens per second. On the NVIDIA GeForce RTX 5090, this reaches 700+ tokens per second.
Google is very straightforward about the trade-offs. DiffusionGemma prioritizes speed and parallel layout creation. Its overall output quality is lower than the standard Gemma 4. For maximum quality production work, Google still recommends autoregressive Gemma 4.
How does text diffusion work?
Text diffusion borrows its core idea from AI image generators. They start from model scene static and iteratively refine it. DiffusionGemma applies a similar pattern to text generation.
This process takes place in three conceptual stages. First, the model starts with a canvas of random placeholder tokens. Second, it passes over that canvas several times. It locks high-trust tokens and uses them as references. Third, the text is converted into the final output.
Google calls the main mechanism Uniform State Diffusion. High confidence tokens help to resolve adjacent situations during denoising. The entire sequence then comes into focus over several passes.
In practice, the model represents a 256-token canvas in parallel. This finalizes approximately 15-20 tokens per forward pass. That similarity is what drives the throughput gain.
The model uses bidirectional attention during denoising. Every token on the canvas can join every other token. This is a sharp break from the autoregressive model. Those models can only look backward at prior tokens.
That bidirectional reference enables self-correction in real time. If the confidence of a token decreases, the sampler can re-noise it. The model replaces that token on subsequent passes. Autoregressive models cannot do this, because they commit each token once.
architecture
Technological progress here is hardware usage. For local GPU inference, the main constraint is memory bandwidth. Autoregressive models repeatedly load weights from memory per token. During single-user service, the GPU spends most of the time waiting.
DiffusionGemma shifts the bottleneck from memory bandwidth to computation. It drafts and refines the 256-token canvas in parallel. This gives the idle tensor core a larger parallel workload.
The model alternates two attention modes during inference. Uses causal attention to ingest the prefill prompt and write the KV cache. Denoising uses bidirectional attention to refine the canvas.
For longer outputs, the DiffusionGemma block uses autoregressive diffusion. Once the 256-token block is completely denoted, it is committed to the KV cache. The model then starts a new canvas based on prior history. It combines parallel block speed with sequential autoregressive stability.
The architecture is similar to that of the Gemma 4 26B A4B. Developers mainly need to implement a denoising step. This simplifies integration into existing service infrastructure.
A clear example of this is the Sudoku Showcase from Google’s developer guide. Autoregressive models struggle with tight, multivariable constrained puzzles. The base diffusion Gemma model solves about 0% of Sudoku puzzles. Following a simple JAX supervised fine-tuning recipe, the accuracy increases to 80%. The streamlined model also stops before cutting out the inference steps.
Interactive Demo: How DiffusionGemma Decodes in Parallel
The interactive visualizer below shows how DiffusionGemma decodes text compared to standard autoregressive models. Toggle between the two modes and press Run. In auto retrograde mode, tokens fill in one at a time, strictly left to right, taking one forward pass per token – the way most LLMs are generated today. In spread In the mode, the model starts with a canvas of masked placeholder tokens and solves many of them in parallel with each pass, in no fixed order, converging in very few passes. The animation also shows a brief re-noise phase, where a low-confidence token is reset and refined again – a stand-in for the self-correction of the real model, which autoregressive decoding cannot do after a token is committed. Note that this is a conceptual animation, not live model output: the actual DiffusionGemma solves a 256-token canvas and finalizes about 15-20 tokens per forward pass.
View DiffusionGemma Decode in parallel
This is a conceptual animation of the denoising process – not live model output. The actual model solves a 256-token canvas, finalizing ~15-20 tokens per forward pass.
Press Play to begin.
use cases
DiffusionGemma targets specific workloads, not general production quality. Google and ecosystem partners highlight several practical applications:
- In-line editing and code filling: Bidirectional focus is suitable for non-linear text structures.
- Acute recurrence: Low local latency supports interactive, single-user developer loops.
- Long-Context Document Analysis: 256K windows support large input processing.
- OCR and Document Parsing: Handles multimodal input images and scanned documents.
- Code Generation, Tool Calling, and Agentic Workflows: Unsloth lists these as supported functions.
- Constrained Generation: Sudoku, mathematical graphs, and amino acid sequences benefit from parallel attention.
One warning shapes it all. Speedup is designed for local, low-concurrency inference. In high-QPS cloud serving, autoregressive models efficiently saturate the computation. There, parallel decoding provides diminishing returns and may increase service costs.

DiffusionGemma vs Standard Gemma 4
| Property | DiffusionGemma (26B-A4B) | Standard Gemma 4 (26B A4B) |
|---|---|---|
| generation method | Discrete Text Spread (Parallel) | Autoregressive (token-by-token) |
| decode interrupt | computation-bound | memory-bandwidth-constrained |
| parallel unit | 256-token canvas per pass | one token per step |
| Pay attention during decode | bidirectional | Reason (backward only) |
| self-improvement | yes, by making noise again | No, tokens are committed once |
| Speed on dedicated GPU | up to 4 times faster | basic |
| H100 Throughput | 1000+ tokens/second | bottom (baseline) |
| rtx 5090 throughput | 700+ tokens/second | bottom (baseline) |
| output quality | Gemma Under 4 | higher; Recommended for production |
| best fit | local, low-concurrency, interactive | High-quality and high-QPS cloud service |
| license | Apache 2.0 | Gemma Terms |
key takeaways
- DiffusionGemma is a 26B MOE open model (3.8B active) that generates text through parallel diffusion, not token-by-token.
- It runs up to 4x faster on dedicated GPUs: 1000+ tokens/sec on H100, 700+ on RTX 5090.
- Bidirectional attention on a 256-token canvas enables self-correction in real time, unlike autoregressive models.
- Quantized, it fits 18GB of VRAM with day-zero support in VLM, Transformer, MLX, and Unsloth.
- It is experimental and of lower quality than the standard Gemma 4; Google recommends Gemma 4 for production.
MarketTechPost’s visual explainer
DiffusionGemma: A Visual Guide
Google DeepMind’s 26B Open Text Diffusion Model – What it is and how it works.
check it out model weight And technical details. we have also made Brief demo for this research paper. Also, feel free to follow us Twitter And don’t forget to join us 150k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us