DiffusionGemma: 4x Faster Text Generation

Recently I had switched to OpenCode to try out many of the Non-US-Frontier-Labs models. My unexpected favorite model to use was Mercury (a diffusion model). Not because it was “smart” but because it was stupid fast. It was more of a pair-programming experience instead of the SOTA agentic experience of prompting and waiting. Honestly, it was also way more fun and brought back some of the pre-AI coding experience while still getting some benefits of AI. It felt less of a slot machine where you prompt, wait, and hope it went in the right direction. It made me even use the tiny models like Gemini Flash Lite and GPT Mini/Nano more too.

Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.

Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer.

An LLM's decoder computes tokens one-at-a-time because attention has to account for each previous token. The existing LLM decoders scale well when you have enough load to batch many inferences together. Diffusion of limited benefit there. On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottle neck.

NVIDIA are hosting a free endpoint for this one, details at https://build.nvidia.com/google/diffusiongemma-26b-a4b-it - you have to create an account and (I think) verify a phone number too.

(I got it to draw a pelican: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... )

A good visual explanation of how text diffusion models like DiffusionGemma work: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

A few days ago I was just thinking that Google never talked about their diffusion text generation model after demoing it at I/O a year ago. The rumor is that it was too expensive to run, but with the provided chart using the same 1x H100 hardware and comparing DiffusionGemma to regular Gemma, that shouldn't be the case. I'm curious what the downside for this speed is here aside from being slightly weaker than Gemma.

> DiffusionGemma reverses this inefficiency. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer's processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential. It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.

> Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.

Okay, so Gemma 4 26B is a MoE model that's really fast on my 24 GB GPU using ollama. This sounds like speculative decoding but I don't think that works with MoE models? It's hard to keep up with all this when it's not your job to keep up with it.

What would a diffusing reasoning model look like? have a pre-defined length [thinking] block that gets diffused over a long time, and then the final output block uses what is in that thinking block as part of its input? And how do diffusion models decide the output length in the first place, is it a pre-set parameter? or does it diffuse an [end] token into the middle somewhere?

it just me that thinks its kinda weird that they conflate speed in tokens/second and latency, when i think of latency as time to first token? like it generates an entire paragraph of tokens faster but wouldnt it still be slower if your reply is only 1 word because it has to do the entire 256 tokens as a chunk

I think this is the future. The sort of left-field rumble that turns into a quake in 5 years.

Can LoRAs be used to increase the quality of these diffusion models? Nvidia mentions something about this https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B#inf...

It is cool but local models while okay already feel noticeably worse than even the cheapest APIs so I can't see myself sacrificing even a little bit of their quality for speed. I'm sure it's worth it for some usecases, curious to hear specific ones that people are already planning to deploy to production.

Is the diffusion approach any use in Multi-Token Prediction (MTP) drafters? https://blog.google/innovation-and-ai/technology/developers-...

I just *love* the commit message on Github: "Make TPUs go brr"

I'm curious how diffusion models do at tool calling, curious what wins there are there.

The video demo of the svg sword is an interesting example of what is so interesting about diffusion models: it's not just putting one token after another to make edits to a file. It's skipping around, it's re-editing previous lines. I feel like forcing it to write too calls is maybe not its best nature.

I feel like perhaps instead of a monolithic edit file tool call, perhaps the diffusion model would be better suited to posting a change stream, a series of edit ops, across multiple files.

I can’t help but feel like there’s something here that will matter for future LLMs.

The bidirectionality could be a big deal: being able to refine a sentence with both left and right context feels closer to how editing/thinking actually works than committing to each token forever.

Maybe the current models aren’t good enough yet, but the direction feels important.

We need more local open weight models that are performant and just as good (or good enough) as the best frontier ones.

Then you will be able to achieve Jevons Paradox and enjoy the same “productivity gains” without paying for these extortionate token prices by closed model providers or have it as cheap as possible.

And especially, no silent nerfing of the model.

Can LoRAs be used to increase the quality of these diffusion models? Nvidia mentions something about this https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B#inf...

I just *love* the commit message on Github: "Make TPUs go brr"

got one answer by reading the rest of the comments, makes sense that the diffusion process is inherently reasoning-like: https://www.inceptionlabs.ai/blog/introducing-mercury-2

A good visual explanation of how text diffusion models like DiffusionGemma work: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

I can’t help but feel like there’s something here that will matter for future LLMs.

The bidirectionality could be a big deal: being able to refine a sentence with both left and right context feels closer to how editing/thinking actually works than committing to each token forever.

Maybe the current models aren’t good enough yet, but the direction feels important.

Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer.

I'm curious how diffusion models do at tool calling, curious what wins there are there.

I feel like perhaps instead of a monolithic edit file tool call, perhaps the diffusion model would be better suited to posting a change stream, a series of edit ops, across multiple files.

> I'm curious what the downside for this speed is here

"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"

Well with a standard autoregressive model you can generate for example 256 tokens at once if you have 256 users, with this approach you can generate 256 tokens for a single user but you need several forward steps.

So the diffusion process takes more GFLOPs, if you have enough users you can already balance memory and compute.

Is the diffusion approach any use in Multi-Token Prediction (MTP) drafters? https://blog.google/innovation-and-ai/technology/developers-...

Yes, DFlash is currently a SOTA speculative decoding method that Xiaomi just used in their MiMo model for >1000tkps

MTP is a training optimization. Drafting requires verification, and verification is the full model inference. Speculative decoders are the name for the inference time optimization, that is more like a verifier that is a smaller model.

NVIDIA are hosting a free endpoint for this one, details at https://build.nvidia.com/google/diffusiongemma-26b-a4b-it - you have to create an account and (I think) verify a phone number too.

(I got it to draw a pelican: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... )

I register few weeks ago, the account still not verified, despite following the procedure. Can't use API if the account not verified.

Maybe writing / bootstraping unit tests?

Does not need opus level to write, and easy to iterate on.

This is a different model with, confusingly, approximately the same number of params as the existing gemma4 MoE. Unclear from a quick scan whether one was trained somehow from the other.

The mechanism isn't the same as speculative decoding. Speculative decoding happens sequentially and (usually) a couple of tokens at a time; diffusion doesn't, and does blocks of text at once. I haven't read the collateral yet but my assumption would be that it's trained to keep the specific experts stable across a diffusion block.

Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.

If you can run your tests fast and cheaply, and have metrics that show what bad/sloppy code is that are cheap & fast to generate, a worse fast model can outperform a far better far slower model if you value time...

I've had pretty good success with LLMs after putting in place metrics to measure true complexity (not cyclomatic), and automatically pushing back everything until the added complexity is within reason for the feature.

I wonder how much this will impact locally used models for coding. I can imagine using diffusion models that are x-times faster than Qwen or Gemma 4 - where I have to do more "pre-ai" work which is a good thing and can have a very fast, very cheap model to work with locally. I assume since it doesn't do heavy computing for a long time that it's cheaper to run on local hardware as well?

Could you say more about how you use it? What does your workflow look like?

So you're making smaller edits?

I think this is the future. The sort of left-field rumble that turns into a quake in 5 years.

This may be the future of local models.

The thing is, diffusion models perform somewhat worse than autoregressive on text. So you lose some performance.

Speed is the big advantage. Autoregressive when doing local inference is mostly memory bound; you're doing one token at a time, for each token you need to load all weights. MTP helps a bit by allowing you to draft tokens in a smaller model and then verify them in parallel with the larger model, allowing you to do a few computations for every memory load, but because you're still doing tokens sequentially and need to discard invalid drafted tokens, you can only get so much speedup.

For hosted models, however, you can batch many token generations together, fully utilizing all of the compute while no longer being bottlenecked on memory bandwidth. So they are already operating at close to max efficiency.

So, diffusion kind of loses its beneifit in hosted models. Sure, maybe you could pay more to have slightly lower latency responses by doing diffusion for one user at a time instead of autoregressive for many in parallel. But given that it also reduces accuracy, it's hard to see where you'd really want that. Unless they're able to bring it up to par with autoregressive, it seems like it's a bit of a dead out outside of local models where you're generally just doing one thing at a time.

Almost certainly not if things remain as they are. The reason there's been little traction is the quality gap between diffusion and autoregressive models is pretty stark. I mean just look at the benchmarks here. Large dropoffs, with the hardest benchmarks seeing the largest drops. On top of that, almost all the speed benefits of diffusion models become negated at scale. So this is only attractive for local model development and almost everyone training local models still care about pound for pound quality and inference efficiency at scale.

We need more local open weight models that are performant and just as good (or good enough) as the best frontier ones.

And especially, no silent nerfing of the model.

We have this though, right? Compare SOTA local models to where the frontier was last year. There weren't many people complaining that last year's frontier models were incapable.

Next year, and the year after, Fable, GPT 5.5 and Gemini 3.5 will feel quite ordinary. And perhaps even within reach of a prosumer running models locally.

got one answer by reading the rest of the comments, makes sense that the diffusion process is inherently reasoning-like: https://www.inceptionlabs.ai/blog/introducing-mercury-2

Yes, DFlash is currently a SOTA speculative decoding method that Xiaomi just used in their MiMo model for >1000tkps

I register few weeks ago, the account still not verified, despite following the procedure. Can't use API if the account not verified.

> I'm curious what the downside for this speed is here

Maybe writing / bootstraping unit tests?

Does not need opus level to write, and easy to iterate on.

I can see it but even if I do that for something like tests I'd still eat the time cost of the normal Gemma for 10% extra performance. And further, if you switch between the fast and normal Gemma for different tasks you eat the big time cost of loading the other model (and maintaining both in the first place).

So the diffusion process takes more GFLOPs, if you have enough users you can already balance memory and compute.

This is a different model with, confusingly, approximately the same number of params as the existing gemma4 MoE. Unclear from a quick scan whether one was trained somehow from the other.

So you're making smaller edits?

We have this though, right? Compare SOTA local models to where the frontier was last year. There weren't many people complaining that last year's frontier models were incapable.

Next year, and the year after, Fable, GPT 5.5 and Gemini 3.5 will feel quite ordinary. And perhaps even within reach of a prosumer running models locally.

Batching is a fair counterpoint.

Thanks. I found this other comment that links to a very thorough explanation: https://news.ycombinator.com/item?id=48479042

Could you say more about how you use it? What does your workflow look like?

Imagine you’re entirely pre-AI… to do some work, you read code, think, then write some code across a number of files. Usually then a small dance with compilation/unit tests to address anything broken. Along the way, you use your human judgement on style and quality, and midway through your change you might refactor something based on learned best practices (eg, when to use a static method, or helper class).

Today, even the dumbest AI agents can trivially loop through the final dance to get compilation, and often unit tests (depending on scope of failure). Big SOTA agents have OK code quality, but if left unattended or unchecked will still generate pretty sloppy repos after a while. That’s true even when using models like Opus which is ridiculously expensive in comparison.

When using the models in this fast “pair programming” style, I find that I (the human) mostly do all the “plan and think” work, and usually point the smaller agent towards specific files/directories, with specific targeted changes. It’s slower than 1-shot prompting an entire feature, but slightly faster than doing it manually, and I find the code is less “slop” because the changes are smaller and more human. It retains the agentic benefits of handing imports, compilation iteration, etc and can do basic cross-file plumbing. It’s also cheap and fast to do iterations like “wait make that method static” or “let’s update this to use <other util class>” and things like that. When the agent is slow to make localized edits, I find I’m less likely to push for minor nit-picks and style updates.

It's fast enough that "ask it twice and pick the best" should still come out ahead performance-wise. I don't know how much that would close the quality gap by, but it's worth a play.

This may be the future of local models.

The thing is, diffusion models perform somewhat worse than autoregressive on text. So you lose some performance.

I'm particularly curious to know how this plays out, and I seriously hope that more labs focus on diffusion models for text usage.

My immediate thought - this performs slightly worse than the autoregressive gemma equivalent, but it may also let me functionally run better models in diffusion variants.

Ex - I can run 70b-120b autoregressive models locally right now, but I get ~5-15t/s, which just isn't fast enough for serious work.

Which caps me down in the 20-36b models (ex - gemma4) where I can get 100+t/s on the same hardware.

So the question becomes - does the quality drop from a diffusion model outweigh the quality bump from using a larger model?

Because if not... sounds like diffusion models have a lot of space to thrive.

---

Sadly - if they can't be hosted profitably, I question whether this space will actually be explored.

Batching is a fair counterpoint.

Thanks. I found this other comment that links to a very thorough explanation: https://news.ycombinator.com/item?id=48479042

It's fast enough that "ask it twice and pick the best" should still come out ahead performance-wise. I don't know how much that would close the quality gap by, but it's worth a play.

I'm particularly curious to know how this plays out, and I seriously hope that more labs focus on diffusion models for text usage.

My immediate thought - this performs slightly worse than the autoregressive gemma equivalent, but it may also let me functionally run better models in diffusion variants.

Ex - I can run 70b-120b autoregressive models locally right now, but I get ~5-15t/s, which just isn't fast enough for serious work.

Which caps me down in the 20-36b models (ex - gemma4) where I can get 100+t/s on the same hardware.

So the question becomes - does the quality drop from a diffusion model outweigh the quality bump from using a larger model?

Because if not... sounds like diffusion models have a lot of space to thrive.

---

Sadly - if they can't be hosted profitably, I question whether this space will actually be explored.

Jun 10, 2026

Our newest open experimental model delivers up to 4x faster inference on dedicated GPUs and opens the door to exploring speed-critical, interactive local workflows.

Brendan O'Donoghue

Research Scientist

Sebastian Flennerhag

Research Scientist

DiffusionGemma

Your browser does not support the audio element.

Listen to article

This content is generated by Google AI. Generative AI is experimental

[[duration]] minutes

Today, we’re introducing DiffusionGemma, an experimental open model that explores text diffusion, an exceptionally fast approach to text generation. Released under an Apache 2.0 license, this 26B Mixture of Experts (MoE) model moves beyond the sequential token-by-token processing of typical autoregressive Large Language Models (LLMs). Instead, it generates entire blocks of text simultaneously, delivering up to 4x faster text generation on GPUs.

Intelligence vs Latency

Built upon the industry-leading intelligence-per-parameter of our Gemma 4 family and cutting-edge Gemini Diffusion research, DiffusionGemma integrates a novel diffusion head designed to maximize generation speed. While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma is designed for researchers and developers exploring speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures.

Unlocking new value for developers

Developers building real-time interactive AI applications often struggle with the latency bottlenecks of local inference. DiffusionGemma addresses these challenges directly, with some key trade-offs:

Blazing fast inference: By shifting the decode bottleneck from memory-bandwidth to compute, DiffusionGemma generates up to 4x faster token output on dedicated GPUs. (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090). 1
Accessible hardware footprint: Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
Bi-directional attention: Generating 256 tokens in parallel with each forward pass allows every token to attend to all others. This provides significant advantages for non-linear domains such as in-line editing, code infilling, amino acid sequences or mathematical graphs.
Intelligent self-correction: The model iteratively refines its own output, allowing it to evaluate the entire text block at once to fix mistakes in real-time.
Experimental status & production recommendations: Because it prioritizes speed and parallel layout generation, DiffusionGemma’s overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4.

DiffusionGemma Benchmark

You can improve DiffusionGemma's performance on specific tasks through fine-tuning. In the example below, Unsloth fine-tuned DiffusionGemma to play Sudoku — a task autoregressive models struggle with because each token depends on future tokens. DiffusionGemma's bi-directional attention makes this much easier.

Fine-tuned DiffusionGemma solving Sudoku.

Why diffusion for text?

While the AI research community has explored diffusion-based text generation for years, applying it to large models has remained a challenge. DiffusionGemma changes this by shifting how models use hardware.

The trade-off with traditional models

Most language models act like a typewriter, generating one token at a time from left to right. In the cloud, this is efficient because servers can batch thousands of user requests together to share the hardware load. But when run locally for a single user, this word-by-word process leaves your dedicated GPU or TPU underutilized — it spends most of its time simply waiting for the next "keystroke."

DiffusionGemma reverses this inefficiency. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer's processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential. It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.

DiffusionGemma text-to-3D SVG demo by Hugging Face. Step-by-step generation.

This means DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs. The throughput advantage is strongest at low-to-medium batch sizes on a single accelerator.

How text diffusion works

Similar to AI image generators that start with visual static and iteratively refine it into a clear picture, DiffusionGemma applies this to text:

The canvas: The model starts with a canvas of random placeholder tokens.
Iterative refinement: The model makes multiple passes, locking in correct tokens and using them as context clues to refine the rest.
Final polish: The text converges into high-quality output.

Because the model can process the whole paragraph while generating, it unlocks new patterns of model behavior, like perfectly closing complex markdown formatting or generating and rendering code in near real-time.

Get started today

Download the weights: Access the experimental model weights (released under a permissive Apache 2.0 license) right now on Hugging Face.
Integrate & learn: Learn more in our DiffusionGemma developer guide. Or deep dive into A Visual Guide to DiffusionGemma to understand the mechanics under the hood.
Use your favorite development tools: Serve the model efficiently using MLX, vLLM (with integration supported by Red Hat), and Hugging Face Transformers. For rapid experimentation, we are releasing a fine-tuning tutorial using Hackable Diffusion, a modular JAX toolbox designed for composability. You can also explore fine-tuning with Unsloth and NVIDIA NeMo. Additionally, official support for llama.cpp is arriving soon.
Experience optimized performance: We worked with NVIDIA to optimize across their hardware stack, ensuring compatibility with consumer setups (quantized for GeForce RTX 5090 and 4090 GPUs) alongside high performance on enterprise systems (Hopper and Blackwell using advanced NVFP4 kernels), including NVIDIA DGX Spark and DGX Station for local deskside deployment, and RTX PRO for AI professionals. Native support for NVFP4 (4-bit floating-point) accelerates compute throughput, allowing the model to run at faster speeds with near-lossless accuracy.
Try your way: Run on your desktop dedicated GPU or in the cloud through Gemini Enterprise Agent Platform Model Garden or NVIDIA NIM.

Note: Because this speedup relies on exploiting the high arithmetic intensity of accelerators, unified-memory architectures like those in Apple Silicon Macs — which are often memory-bandwidth-bound rather than compute-bound during inference — may not see the same acceleration over autoregressive models like Gemma 4.

Hacker Times