Accelerating Gemma 4: faster inference with multi-token prediction drafters

Speculative decoding is an amazingly clever invention, almost seems-too-good-to-be-true (faster interference with zero degradation from the quality of the main model). The core idea is: if you can find a way to generate a small run of draft next tokens with a smaller model that have a reasonable likelihood of being correct, it's fast to check that they are actually correct with the main model because you can run the checks in parallel. And if you think about it, a lot of next tokens are pretty obvious in certain situations (e.g. it doesn't take a frontier model to guess the likely next token in "United States of...", and a lot of code is boilerplate and easy to predict from previous code sections).

I always encourage folks who are interested in LLM internals to read up on speculative decoding (both the basic version and the more advanced MTP), and if you have time, try and implement your own version of it (writing the core without a coding agent, to begin with!)

I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance.

It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.

Looks like DeepSeek did this as well since V3: https://deepwiki.com/deepseek-ai/DeepSeek-V3/4.4-multi-token...

Credit for the MTP technique is due to https://arxiv.org/abs/2404.19737 from 2024:

Better & Faster Large Language Models via Multi-token Prediction Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

I’m starting to think that googles strategy is a bit different then the other frontier providers.

Focusing more on performance to compute efficiency over pure performance. And maybe that’s why Gemini is (seemingly) lagging behind?

Other providers hitting capacity and hitting the limits subsidising their inference.

Google strategy seems to be about scaling and distributing these models to their existing billions of users.

The best IOS inferencing model comes from Google..

In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance.

You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.

Really excited to try this once it is merged into llama.cpp.

Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.

Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

This might be silly, but … since the assistant models are so much smaller than the full models. What if we just use those smaller models?

Any idea how much worse they will be ? Or is the issue that their error will really diverge as you accept more of their tokens?

Sounds like a game changer if I see that kind of speed up on my hardware. So far I've prefered Qwen 3.6 because of its better tool handling, even though Gemma 4 is faster, but I saw they've updated the model template and that's supposed to be better now. Looking forward to trying this with llama.cpp.

So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

I am getting 21 t/s on Fold 7, 21 x 1.8 = 37.8 t/s compared to M1 Max's 54 t/s, that is impressive

I wonder what latency and tok/s this model on Groq or Cerebras would be capable of. I have a couple LLM driven games [1][2] where speed is really important to the experience. Currently the best performance I can get is the gpt-oss models on Groq or Cerebras but they need quite a bit of extra context and tools to correct for mistakes. I'm making a bet I'll be able to get the same performance much cheaper in the next few months.

[1] https://sleuththetruth.com [2] https://lextension.net/

Works great in the latest version of Google AI Edge Gallery: https://github.com/google-ai-edge/gallery/releases

How is this different from the speculative decoding that we had before?

You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.

The blog says something about re-using the big model's data?

I'm not sure I understand how this work https://huggingface.co/google/gemma-4-E4B-it-assistant has 78.8M parameters while the standard variant https://huggingface.co/google/gemma-4-E4B-it has 8B parameters.

Is gemma-4-E4B-it-assistant a model I can use stand-alone or a model I need to use in combination with gemma-4-E4B-it?

these are the updated models:

google/gemma-4-31B-it-assistant

google/gemma-4-26B-A4B-it-assistant

google/gemma-4-E4B-it-assistant

google/gemma-4-E2B-it-assistant

3x faster inference means cheaper api costs tooo. For solo dev building ai this matters a lot

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

Does this mean there will be new Gemma 4 models released with MTP, or are they already available in existing models + quants?

Seems like a pull request for vLLM was just approved a few minutes ago:

https://github.com/vllm-project/vllm/pull/41745

("Add Gemma4 MTP speculative decoding support")

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

>try them directly on Google AI Edge Gallery for Android or iOS.

I'm not seeing any update to the app on my android phone... maybe later today?

>We’ve published an in-depth technical explainer

I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...

This is sort of similar to Ethereum and maybe a bit of zero knowledge proofs but with the LLM handling both sides.

Tested gemma4 26 MoE 4bit quantisized gguf on llama.cpp following these guides with mmap'd I/O on a 16GB MBP and it was unbearably slow (0.0 t/s).

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

don't know about this guy, but qwen3.6:27b with the UD 4bit quant and little-coder/pi has been amazing. the first local LLM experience that can do actual meaningful work

Anyone tried this with vLLM yet? I am confused on how to turn this on tbh.

Gemma4:e4b is a huge upgrade

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

if someone wants to work with gemma and dont deal with ollama or configs - there is (my baby) https://airplane-ai.franzai.com/

Beta but useable

Gemma 4 is really a beast. The 31B version is totally usable like for cases when I'm bored without internet

I found that Gemma 4:26b makes way more mistakes compared to Qwen and Gemma 3. Gemma3 27b QAT was my goto for some time as this was quite fast. Qwen is still king for a balance of accuracy and inference speed.

Gemma:31b was more accurate but speed was horrendous.

ok so? Anyone got a verdict/review?

Gemma:31b was more accurate but speed was horrendous.

Works great in the latest version of Google AI Edge Gallery: https://github.com/google-ai-edge/gallery/releases

[1] https://sleuththetruth.com [2] https://lextension.net/

You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.

The best IOS inferencing model comes from Google..

Looks like DeepSeek did this as well since V3: https://deepwiki.com/deepseek-ai/DeepSeek-V3/4.4-multi-token...

Credit for the MTP technique is due to https://arxiv.org/abs/2404.19737 from 2024:

Better & Faster Large Language Models via Multi-token Prediction Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

I am getting 21 t/s on Fold 7, 21 x 1.8 = 37.8 t/s compared to M1 Max's 54 t/s, that is impressive

Gemma4:e4b is a huge upgrade

This is sort of similar to Ethereum and maybe a bit of zero knowledge proofs but with the LLM handling both sides.

Tested gemma4 26 MoE 4bit quantisized gguf on llama.cpp following these guides with mmap'd I/O on a 16GB MBP and it was unbearably slow (0.0 t/s).

Seems like a pull request for vLLM was just approved a few minutes ago:

https://github.com/vllm-project/vllm/pull/41745

("Add Gemma4 MTP speculative decoding support")

ok so? Anyone got a verdict/review?

Gemma 4 is really a beast. The 31B version is totally usable like for cases when I'm bored without internet

I’m starting to think that googles strategy is a bit different then the other frontier providers.

Focusing more on performance to compute efficiency over pure performance. And maybe that’s why Gemini is (seemingly) lagging behind?

Other providers hitting capacity and hitting the limits subsidising their inference.

Google strategy seems to be about scaling and distributing these models to their existing billions of users.

I don't view Gemini as falling behind. I actually view it as a somewhat distinct type of intelligence compared to the latest iterations of GPT5 and Claude. The latter are, increasingly, very focused on productivity and automation of work tasks. They're optimized for long, agentic, self-correcting reasoning loops. Gemini is very different: it feels to me like a much smarter baseline model, with much deeper intuition (especially its Deep Think mode), but it's not nearly as good at long-range self-corrective agentic loops. For months now my workflow has been to use Gemini for creative leaps and insights, while preferring Codex or Claude or GPT5.5 Pro for routine or precision work.

> Google strategy seems to be about scaling and distributing these models to their existing billions of users.

Yeah, part of that is installing a model in chrome to millions of users without consent.

Isn't that where everyone's strategy is shifting?

A key thing to understand about Google is that under the hood is a collection of extremely powerful fiefdoms (many of which would stand as their own fortune 500, hell 100) that are all trying to act in their own interest. It's almost closer to a conglomerate than a company, where Google needs to bid internally against external players for resources.

If Gemma 4 is less lucrative than Claude to the Google Cloud kingdom, the Cloud kingdom will want you using Claude.

There is a decent yt here going through what google's logic with gemma overall might be

https://www.youtube.com/watch?v=sXgZhGzqPmU

As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway

I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?

Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?

Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?

As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis

Makes me wonder about the partnership with apple to use gemini. safe to assume apple has a preference for on-device, and the best open model (for consumer hardware at least) is a google property with an apache 2 license. Interesting dynamic and seemingly a bright spot in the market

You can use it for free with Google AI studio (free tier or paid tier accounts with different limits). Or use the paid version from Vertex AI which is around 3x cheaper than Gemini 3 Flash.

I'm using Gemma 4 31B in my app with 5 agents, 1.5k requests per day, each.

What do you mean? It just works with Google AI Studio.

these are the updated models:

google/gemma-4-31B-it-assistant

google/gemma-4-26B-A4B-it-assistant

google/gemma-4-E4B-it-assistant

google/gemma-4-E2B-it-assistant

for anyone wanting a glossary to explain the naming scheme here:

E4B = 4B effective parameters (using per-layer embeddings)

E2B = 2B (like above)

it = instruction tuned (rlhf and all that jazz)

assistant = Multi-token drafters (the new 2x speed up)

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

Is it really no quality degradation?

I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.

I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.

Memory and compute/energy overhead

It's based on taking advantage of spare compute if you have it. A tiny model generates a few steps ahead first, then the large one runs batch inference on all of those at once as if you are at that point in time. If they all check out afterwards it jumps ahead, otherwise it discards and goes onto the next one.

Not sure about this implementation, but conceptually it only works well on very capable GPUs for very predictable output. Typical speedup is about 30%, not sure how google is claiming 250% which is ridiculous.

And if you don't have enough compute, then you get negative speedup from all the extra overhead.

> it's fast to check that they are actually correct with the main model because you can run the checks in parallel.

Can you give an intuition as to why it's faster? I would have thought regardless how many you run in parallel, the successful check has to execute the full model to generate the full sequence so you will have exactly the same time needed? Or is it by process of elimination so it terminates early once it eliminates the non-viable choices? (in which case, how do you guarantee the correct output was speculatively generated at all to be the last survivor?)

So we've basically taken the concept of branch prediction from CPUs and applied it to LLMs?

Naively it seems odd that running multiple checks in parallel is faster than just running the autoregressive model multiple times in series. It’s the same amount of compute right?

But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources. And so checking multiple tokens is cheap because we can batch and thus reuse the read weights for multiple tokens.

The verification step is similar to a prefill with a small batch size. The difference is what we do with the generated logits.

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

You could keep multimodal projector (understanding of audio, images & PDFs) in system RAM with `--no-mmproj-offload` in llama.cpp. Of course, then it is not accelerated with GPU, but you save its VRAM.

Qwen is still better that Gemma though. Also you can tune it more for different tasks, which means that you can prioritize thinking and accuracy versus inference speed.

I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance.

This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

You're right about it being reminiscent of the dial-up area, but I don't believe it's 300 to 1200; it's more like 4800:

Modem vs Claude according to Claude:

300 @ 2368 characters - 1m 19s

1200 @ 2368 characters - 19.7s

2400 @ 2368 characters - 9.9s

14.4K @ 2368 characters - 1.6s

33.6K @ 2368 characters - 705 ms

56K @ 2368 characters - 447 ms

Claude @ 2368 characters - 7.9s

There was a startup posted here which built custom hardware that let the AI respond instantly. Thousands of tokens per second.

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.

[1] https://github.com/ml-explore/mlx-lm/pull/990

[2] https://github.com/ggml-org/llama.cpp/pull/22673

Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.

Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.

They're somehow connected to vision & block speculative decode...don't ask me how/why though

For gemma specifically had more luck with speculative using the llama-server route than lm studio

I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

gemma4 has a specific problem with toolcalls that affects most runtimes. fixes for ollama and vllm are being worked on right now

Really excited to try this once it is merged into llama.cpp.

Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.

Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

On vllm with a 5090 I get 120-180TPS with the awq 4 bit quant + MTP speculative decoding

For gemma4 26B, same quantization, I get >200TPS.

Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

It doesn't seem that compelling to me. I can get the gpt-oss models cheaper from the openrouter nitro providers like groq and cerebras. The model you mention on Cloudflare infra is the same price through open router or directly.

They do indeed. See https://developers.cloudflare.com/workers-ai/models/ They seem to allow some free usage without user account. Do they list limits anywhere?

How is this different from the speculative decoding that we had before?

You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.

The blog says something about re-using the big model's data?

Multi token prediction is the same thing as speculative decoding. This is mentioned in the Google pages describing their MTP implementation.

Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it".

The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired.

This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models.

It's the same speculative decoding. The news is that it came out for a popular local model.

As far as I can tell MTP is unique from regular speculative decode because the small model is trained to consume and operate on the big model's hidden state for prediction.

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

It even fits on a 3060 with turboquant / Q4 at decent speed (40T/s) for ~200$ (:

Some of the early quants for qwen3.6 were broken. It's still finicky but with a little hand holding it's crazy.

Local models are the future it's awesome

The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.

The 31B is surprisingly fast too, for a dense model. Runs tg at least twice as fast as it ought to on my machine when compared to other 30B, probably due to the hybrid attention I guess. Ingestion is somewhat slower though.

This might be silly, but … since the assistant models are so much smaller than the full models. What if we just use those smaller models?

Any idea how much worse they will be ? Or is the issue that their error will really diverge as you accept more of their tokens?

I think they'll be extremely worse on their own

Predicting "America" in "The United States of ..." Is a different task from predicting the whole sentence.

So the small model is laying the blocks, and the bigger model would be cementing them in place or kicking them down. The bigger model's course correction is what keeps the smaller models predictions relatively on track

I assume these are just output layers that are trained on the hidden state from the larger model - that's how MTP works. It's not a separate drafting model.

So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.

similar idea, but the failure mode is better. a branch mispredict burns cycles. a bad guess here usually just means no bonus tokens. https://arxiv.org/abs/2211.17192

>try them directly on Google AI Edge Gallery for Android or iOS.

I'm not seeing any update to the app on my android phone... maybe later today?

>We’ve published an in-depth technical explainer

I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...

Is gemma-4-E4B-it-assistant a model I can use stand-alone or a model I need to use in combination with gemma-4-E4B-it?

if someone wants to work with gemma and dont deal with ollama or configs - there is (my baby) https://airplane-ai.franzai.com/

Beta but useable

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

3x faster inference means cheaper api costs tooo. For solo dev building ai this matters a lot

don't know about this guy, but qwen3.6:27b with the UD 4bit quant and little-coder/pi has been amazing. the first local LLM experience that can do actual meaningful work

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

Does this mean there will be new Gemma 4 models released with MTP, or are they already available in existing models + quants?

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

Anecdotally the 15/month basic Gemini plan allows coding all day. I'm not hitting the limits or needing to upgrade to 100/month plans like other people are doing with Claude or Codex.

Caveat: Gemini has been dumbed down a few times over the last year. Rate limits tightened up too. So it might not be this good in the future.

In the Dwarkesh's podcast Dylan Patel from SemiAnalysis said that Google can currently afford to have larger models than competitors, because of access to much more compute, TPUs etc.

That could explain the token usage difference because larger models usually use less tokens per the same unit of intelligence.

Claude is very fashionable right now, but I've never had any problems or felt the need to switch.

Maybe after Google I/O, more people will catch on to how good it is.

This is true, we have the numbers to back it up on https://gertlabs.com/rankings?mode=oneshot_coding (check out the efficiency chart too)

GPT 5.5/5.4 are the smartest models, but at great token / code bloat cost. Qwen 3.6 Max strikes a good balance. But Gemma 4 26B writes some really efficient code, with great results considering the model size. Things do start falling apart under higher contexts.

Gemini models, even if not so good at coding, are also competitive with GPT-5.5 and Claude Opus 4.7 in a lot of tasks while having considerably less parameters.

True, but you have to add up the cumulative token output if your being fair. That alignment issue requires another set of input and output tokens to correct.

I think you can see this one of two ways: you could also consider it a miracle that the qwen models are able to perform so well when being trained on inefficient wrapper code data.

One of the consequences of Gemma's speed is that you can run it on a GPU that's technically too small for it. I've run it on my 4070, and while the output wasn't blazingly fast, it was usable. (Though I haven't used it for anything complex yet. I'm sure that will be different.)

Among benchmarkers its a frequent topic. Qwen BURNS reasoning to get its scores.

it won't really do much if you try to code with it. i plugged it into xcode and it failed to change a variable.

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

A few days ago I switched again from Qwen3.6 to Gemma 4 - for personal use I've experienced better average performance with the 26B version of the latter than the 27B of the former.

For someone who's been running local models for a long while, these are very very exciting times.

There's also a growing interest on integrating DFlash: https://github.com/ggml-org/llama.cpp/issues/21978, I can't wait to see how it will compare against MTP

I'd love to see this in oMLX too. It has been a rather nice tool

I have a dumb performance question.

Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?

I don’t exactly know where MTP inference fits within the inference stack, but does someone know whether it’s possible to implement it for the MLX universe?

Thanks for the link,it took qwen3.6-27B-q8 w/256k context on my RTX A6000 from ~20t/s to 55t/s. Prefill is mysteriously slower however, but prefill is so much faster still that I think I'm still bottlenecked on output most of the time.

How does this get added in practice?

yet, still mostly useless.

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

Anyone tried this with vLLM yet? I am confused on how to turn this on tbh.

You need the regular gemma model as well. You can think of this as a really small distillation of the original. Useless by its own because it often is wrong, but it is fifth more than not. And because verifying a transformer model can be done faster than running it. We can effectively speed up by using this draft model and only doing the compute where it was wrong.

This is a oversimplification, but tldr you need both yes.

LM Studio (for example) is free, can you pitch me on your USP vs. it?

biggest pain is currently waiting for apple for the next release with updates mac os app store screenshots

They're using the term speculative decoding but doing MTP. It's the same thing as Nemotron, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)

Not necessarily. Servers serving the model likely has enough traffic that they are batching decodes already. MTP reduces latency and increase efficiency only when the server can’t batch enough concurrent streams to be compute bound rather than memory bound.

i think this is mixing two separate ideas. MTP is the training-side piece. speculative decoding is the inference trick. DeepSeek V3 used MTP as an auxiliary loss. the 2022 Google paper is speculative decoding. now Google is combining them. https://arxiv.org/abs/2404.19737

For each of the 4 gemma-4-*-it models there has been published an associated small model gemma-4-*-it-assistant, to be used for MTP.

If a GGUF file is generated for MTP, it must include both the big model and the small model. There was a reference in another comment to a PR for llama.cpp, which also included updates for the Python program used for conversion from the safetensors files, which presumably can handle the combining of the two paired Gemma 4 models.

They have now been released on e.g Hugging Face with model suffixes "-assistant".

dumping money into Gemma and shorting new data center buildouts is a level of Corporate Vision that ends up in an HBS case study

> Google strategy seems to be about scaling and distributing these models to their existing billions of users.

Yeah, part of that is installing a model in chrome to millions of users without consent.

There is a decent yt here going through what google's logic with gemma overall might be

https://www.youtube.com/watch?v=sXgZhGzqPmU

As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway

it won't really do much if you try to code with it. i plugged it into xcode and it failed to change a variable.

biggest pain is currently waiting for apple for the next release with updates mac os app store screenshots

dumping money into Gemma and shorting new data center buildouts is a level of Corporate Vision that ends up in an HBS case study

For each of the 4 gemma-4-*-it models there has been published an associated small model gemma-4-*-it-assistant, to be used for MTP.

They have now been released on e.g Hugging Face with model suffixes "-assistant".

Isn't that where everyone's strategy is shifting?

Yes, but I think Google was playing that strategy from essentially day 1 or very early in this AI race, where as the others are there now because of their lack of access of compute.

The general narrative I would read on HN/others, was that Google would be able to outlast/outcompete OpenAI and Anthropic because Google had both more money and more compute. Playing the game of subsidizing their most capable models to capture market share longer than the VCs could.

But instead I feel like Google opted out of that much earlier. Shifting their focus on efficiency and scaling much much earlier. Flash and Gemma being where Google was actually ahead of the competition while everyone was focused on bigger more capable models.

In the last month the environment has changed, compute is constrained, costs for consumers are way higher than expected. Copilot pretty much imploded, and I'm guessing both Anthropic and OpenAI are starting to feel the squeeze.

My personal opinion was this was necessary because integrating AI into products like AI overview, search meant scaling to billions of users was a requirement right out of the gate. And theres not enough money/compute no matter who you are to use frontier models for that.

I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?

Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?

Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?

If you were to believe a lot of metrics Gemma 31B it’s much better than flash lite. It seems like I should be able to pay Google to use it and that should be at least a secretary, called action how I can do that but it’s missing from both the blog post entirely.

i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.

If Gemma 4 is less lucrative than Claude to the Google Cloud kingdom, the Cloud kingdom will want you using Claude.

interesting. presumably this is why google is selling TPUs externally instead of hoarding them for deepmind.