Gemma:31b was more accurate but speed was horrendous.
You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.
Credit for the MTP technique is due to https://arxiv.org/abs/2404.19737 from 2024:
Better & Faster Large Language Models via Multi-token Prediction Fabian Gloeckle, Badr Youbi Idrissi, Baptiste RoziĆØre, David Lopez-Paz, Gabriel Synnaeve
https://github.com/vllm-project/vllm/pull/41745
("Add Gemma4 MTP speculative decoding support")
Focusing more on performance to compute efficiency over pure performance. And maybe thatās why Gemini is (seemingly) lagging behind?
Other providers hitting capacity and hitting the limits subsidising their inference.
Google strategy seems to be about scaling and distributing these models to their existing billions of users.
google/gemma-4-31B-it-assistant
google/gemma-4-26B-A4B-it-assistant
google/gemma-4-E4B-it-assistant
google/gemma-4-E2B-it-assistant
I always encourage folks who are interested in LLM internals to read up on speculative decoding (both the basic version and the more advanced MTP), and if you have time, try and implement your own version of it (writing the core without a coding agent, to begin with!)
However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.
It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.
Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)
They serve gemma-4-26b-a4b-it.
You could pair a big and small model like qwen 32b with qwen 4b and had that same dynamic of the small model generating tokens and the big one "certifiying" them.
The blog says something about re-using the big model's data?
I tried first with Qwen but it was unstable and had ridiculously long thinning traces!
Any idea how much worse they will be ? Or is the issue that their error will really diverge as you accept more of their tokens?
I'm not seeing any update to the app on my android phone... maybe later today?
>Weāve published an in-depth technical explainer
I was expected a pdf link, but this goes to a brief article on twitter/X. lol, okay...
Is gemma-4-E4B-it-assistant a model I can use stand-alone or a model I need to use in combination with gemma-4-E4B-it?
Beta but useable
https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
Yeah, part of that is installing a model in chrome to millions of users without consent.
https://www.youtube.com/watch?v=sXgZhGzqPmU
As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway
If a GGUF file is generated for MTP, it must include both the big model and the small model. There was a reference in another comment to a PR for llama.cpp, which also included updates for the Python program used for conversion from the safetensors files, which presumably can handle the combining of the two paired Gemma 4 models.
Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?
Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?
As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis
If Gemma 4 is less lucrative than Claude to the Google Cloud kingdom, the Cloud kingdom will want you using Claude.
This is a oversimplification, but tldr you need both yes.
E4B = 4B effective parameters (using per-layer embeddings)
E2B = 2B (like above)
it = instruction tuned (rlhf and all that jazz)
assistant = Multi-token drafters (the new 2x speed up)
Not sure about this implementation, but conceptually it only works well on very capable GPUs for very predictable output. Typical speedup is about 30%, not sure how google is claiming 250% which is ridiculous.
And if you don't have enough compute, then you get negative speedup from all the extra overhead.
GPT 5.5/5.4 are the smartest models, but at great token / code bloat cost. Qwen 3.6 Max strikes a good balance. But Gemma 4 26B writes some really efficient code, with great results considering the model size. Things do start falling apart under higher contexts.
Maybe after Google I/O, more people will catch on to how good it is.
That could explain the token usage difference because larger models usually use less tokens per the same unit of intelligence.
Modem vs Claude according to Claude:
300 @ 2368 characters - 1m 19s
1200 @ 2368 characters - 19.7s
2400 @ 2368 characters - 9.9s
14.4K @ 2368 characters - 1.6s
33.6K @ 2368 characters - 705 ms
56K @ 2368 characters - 447 ms
Claude @ 2368 characters - 7.9s
I'm using Gemma 4 31B in my app with 5 agents, 1.5k requests per day, each.
I'm curious where my understanding is wrong, but I didn't think you necessarily got the exact same output with how I understand speculative decoding to be used. I thought that if the small model produces tokens that are "good enough", meaning within the top few tokens the larger model produces, they're accepted.
I thought it doesn't necessarily have to produce the exact same token the larger model would have produced to be accepted (and that requiring this would reduce the hit rate by a lot.) Just one the top model could have produced with whatever top-k and temperature settings.
But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources. And so checking multiple tokens is cheap because we can batch and thus reuse the read weights for multiple tokens.
The verification step is similar to a prefill with a small batch size. The difference is what we do with the generated logits.
Can you give an intuition as to why it's faster? I would have thought regardless how many you run in parallel, the successful check has to execute the full model to generate the full sequence so you will have exactly the same time needed? Or is it by process of elimination so it terminates early once it eliminates the non-viable choices? (in which case, how do you guarantee the correct output was speculatively generated at all to be the last survivor?)
Caveat: Gemini has been dumbed down a few times over the last year. Rate limits tightened up too. So it might not be this good in the future.
Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.
The general narrative I would read on HN/others, was that Google would be able to outlast/outcompete OpenAI and Anthropic because Google had both more money and more compute. Playing the game of subsidizing their most capable models to capture market share longer than the VCs could.
But instead I feel like Google opted out of that much earlier. Shifting their focus on efficiency and scaling much much earlier. Flash and Gemma being where Google was actually ahead of the competition while everyone was focused on bigger more capable models.
In the last month the environment has changed, compute is constrained, costs for consumers are way higher than expected. Copilot pretty much imploded, and I'm guessing both Anthropic and OpenAI are starting to feel the squeeze.
My personal opinion was this was necessary because integrating AI into products like AI overview, search meant scaling to billions of users was a requirement right out of the gate. And theres not enough money/compute no matter who you are to use frontier models for that.
They're somehow connected to vision & block speculative decode...don't ask me how/why though
For gemma specifically had more luck with speculative using the llama-server route than lm studio
For gemma4 26B, same quantization, I get >200TPS.
Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average
Local models are the future it's awesome
Predicting "America" in "The United States of ..." Is a different task from predicting the whole sentence.
So the small model is laying the blocks, and the bigger model would be cementing them in place or kicking them down. The bigger model's course correction is what keeps the smaller models predictions relatively on track
I already played with Gemma4 on oMLX a while ago. When I have some time I'll check if it supports running MTP models and play a bit more
Edit: Ok, I understand now. You are saying that MTP has two aspects. 1) The training (for the mini-models to generate tokens), and 2) The actual speculative decoding implementation on the inference side (which uses those trained mini-models).
For someone who's been running local models for a long while, these are very very exciting times.
Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?
Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it".
The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired.
This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models.
plus over time the harness - coming version has a hotkey for screen capture, next release will have support for native excel, docx export
there is value in being offline by design
Most of the complexity in implementing a simple toy version comes from having to get the KV cache back into a good state for the next cycle (e.g. if only the first half of your draft tokens were correct).
The reason it's designed this way is a bit subtle but it has the advantage during training that you can use a single block of 10 tokens to generate 9 training examples in parallel, so it's highly efficient. This efficiency is basically the main benefit of transformers - the algorithm parallelizes really well and that's what allowed the scale up to large language models as opposed to the previous reality of just language models.
The blog post does discuss why MTP is faster but it's maybe a bit hard to understand if you haven't studied LLM internals. During inference the hardware has arithmetic units idling because they spend so much time waiting for the weight matrices to get moved closer to the processors. Because data movement and computation can be overlapped, if you can reuse the same loaded data for multiple calculations at once you're winning - it's free latency-wise because you're just exploiting previously idle resources (it's not free in terms of energy).
Speculative decoding and MTP exploit this to run the model in parallel on several tokens at once. Say your context window contains "The United". The KV cache has been populated by the main model for this set of tokens. The draft model is given "The United" and predicts " States of America" in one forward pass (this part where it can predict multiple tokens at once with a single pass is the MTP part). Then the main model is given the KV cache from last time along with " States of America". In its own forward pass it can then compute in parallel the completions of both "The United", "The United States", "The United States of" and "The United States of America" (the last one might be an eos token indicating it wants to stop talking.). That's the speculative decoding part.
Now you decode the main model at each position (look at the token probabilities and pick one according to some decoding strategy). It's possible the main model didn't pick " States" at all, or picked " States", but then its prediction diverged e.g. if it wants to say "The United States is a country". So you just select the tokens that match and toss all the tokens starting from the one that didn't. Repeat.
The parallelism comes almost for free because the same weight matrices can be reused multiple times before they're swapped out for the next.
So this is a case of trading off idle compute capacity that's waiting for the bottleneck (memory access).
It would be nice if this was a bit more obvious and clear too.
They built an entire wafer ASIC. The entire thing is one huge active ASIC. it takes a lot of cool engineering and cooling to make it work, and is very cool.
As a consumer, 24-32 GB VRAM is affordable ($1-2 k) and that's the frontier I'm most interested in. It's very "two papers down the line". Those models are also feasible to fine-tune, unlike the O(100+B) behemoths. The 4000 Pro Blackwell has very good TDP compared to people insisting on using 300-600W gaming cards. If I was freelancing, I would definitely consider getting a 6000 for work.
And even alibaba's own qwen3.6-plus is $1.95, so it's kinda easy to come to a conclusion that alibaba (nor anyone else) is really interested in hosting that model.
And don't get me wrong, I fully agree with you, qwen3.6 27b is an amazing model. I run it on my own hardware and every day I'm constantly surprised with what it can zero shot.
The draft model essentially predicts the next token quickly, enabling you to start generating the subsequent token in parallel. If the guess is right, the second generated token is correct. If it is wrong, the second generated token is also potentially wrong, so it must be generated again using the correct prior token obtained through the big model.
A poor draft model will simply slow down the process without affecting the output.
The big target model calculates
P(d1)
P(d2|d1)
P(d3|d1 d2)
In parallel. If we were just greedy decoding it would be simple. Just stop when the draft model doesnāt predict the most likely token as judged by the target model. At that point, append the correct token from the target model and kick off both models again in parallel.
In practice we arenāt using greedy decoding. We are sampling and we need to match the target modelās distribution. To do this, we accept tokens from the draft model probabilistically, which is possible because we have the logits of both the draft model and the target at that point. The ratio of their softmax probabilities is used for this.
You are right that actually accepting tokens has to happen sequentially but thatās a heck of a lot faster than a forward pass.
> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.
Right, this is the same way batching works. It's "free" until we exhaust available compute resources, at which point decode throughput becomes compute bound. (This is a good place to be, because scaling out compute is a lot easier than adding fast VRAM.) This is why MTP is mostly useful when you have one or few users, which means compute is abundant. When you're running large batches you're better off using that compute to grow your batch size.
Of course, batch size is usually limited by things like bulky KV caches. So perhaps MTP has some residual use in that setting. But if you're sharing cached context in a subagent swarm, or running a model like the recent DeepSeek V4 with its tiny KV cache, you can go a lot further in processing a larger batch.
I thought "fine-tuning" meant training it on additional data to add additional facts / knowledge? I might be mistaking your use of the word "tune", though :)
Antigravity seems significantly better in comparison, but with lower usage limits. If I run out, I usually don't bother switching to Gemini CLI.
Are you having better results?
Codex is fast and decent, but I REALLY have to stay on top of it. The amount of times it makes executive design decisions on the fly to completely break everything is way too high.
But with this patch I saw 46t/s with qwen3.6 27B q8... this is insane, it's 250% faster than the original speed, there was no gpu I could upgrade to get that kind of boost, amazing!
All 4 gemma-4-*-it models, regardless whether they are dense models or MoE models, have associated small models for MTP, whose names are obtained by adding the "-assistant" suffix.
https://huggingface.co/google/gemma-4-E2B-it-assistant
https://huggingface.co/google/gemma-4-E4B-it-assistant
I'm surprised the 26B-A4B is better? It should be faster too, interesting. I'm excited to try 31B with MTP, because MTP-2 is what makes 27B bearable on the GB10.
What are you using it for? Agent-based coding, or something else?
Not sure why (too amateur sorry).
Though I think qwen was natively trained on toolcalling.
So any tests done with models that have not been updated during the last days are no longer relevant and they must be repeated after updating the models and regenerating any other file formats, like GGUF files.
Interesting, must try tomorrow.
May 05, 2026
By using Multi-Token Prediction (MTP) drafters, Gemma 4 models reduce latency bottlenecks and achieve improved responsiveness for developers.
Olivier Lacombe
Director, Product Management
Maarten Grootendorst
Developer Relations Engineer

Your browser does not support the audio element.
Listen to article
This content is generated by Google AI. Generative AI is experimental
[[duration]] minutes
Just a few weeks ago, we introduced Gemma 4, our most capable open models to date. With over 60 million downloads in just the first few weeks, Gemma 4 is delivering unprecedented intelligence-per-parameter to developer workstations, mobile devices and the cloud. Today, we are pushing efficiency even further.
Weāre releasing Multi-Token Prediction (MTP) drafters for the Gemma 4 family. By using a specialized speculative decoding architecture, these drafters deliver up to a 3x speedup without any degradation in output quality or reasoning logic.
Tokens-per-second speed increases, tested on hardware using LiteRT-LM, MLX, Hugging Face Transformers, and vLLM.

The technical reality is that standard LLM inference is memory-bandwidth bound, creating a significant latency bottleneck. The processor spends the majority of its time moving billions of parameters from VRAM to the compute units just to generate a single token. This leads to under-utilized compute and high latency, especially on consumer-grade hardware.
Speculative decoding decouples token generation from verification. By pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model), we can utilize idle compute to āpredictā several future tokens at once with the drafter in less time than it takes for the target model to process just one token. The target model then verifies all of these suggested tokens in parallel.
Standard large language models generate text autoregressively, producing exactly one token at a time. While effective, this process dedicates the same amount of computation to predicting an obvious continuation (like predicting āwordsā after āActions speak louder thanā¦ā) as it does to solving a complex logic puzzle.
MTP mitigates this inefficiency through speculative decoding, a technique introduced by Google researchers in Fast Inference from Transformers via Speculative Decoding. If the target model agrees with the draft, it accepts the entire sequence in a single forward pass āand even generates an additional token of its own in the process. This means your application can output the full drafted sequence plus one token in the time it usually takes to generate a single one.
For developers, inference speed is often the primary bottleneck for production deployment. Whether you are building coding assistants, autonomous agents that require rapid multi-step planning, or responsive mobile applications running entirely on-device, every millisecond matters.
By pairing a Gemma 4 model with its corresponding drafter, developers can achieve:
Gemma 4 26B on a NVIDIA RTX PRO 6000. Standard Inference (left) vs. MTP Drafter (right) in tokens per second. Same output quality, half the wait time.
To make these MTP drafters exceptionally fast and accurate, we introduced several architectural enhancements under the hood. The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out. For our E2B and E4B edge models, where the final logit calculation becomes a big bottleneck, we even implemented an efficient clustering technique in the embedder to further accelerate generation.
We've also been closely analyzing hardware-specific optimizations. For example, while the 26B mixture-of-experts model presents unique routing challenges at a batch size of 1 on Apple Silicon, processing multiple requests simultaneously (e.g., batch sizes of 4 to 8) unlocks up to a ~2.2x speedup locally. We see similar gains with Nvidia A100 when increasing batch size.
Want to see the exact mechanics of how this works? Weāve published an in-depth technical explainer that unpacks the visual architecture, KV cache sharing and efficient embedders powering these drafters.
The MTP drafters for the Gemma 4 family are available today under the same open-source Apache 2.0 license as Gemma 4. Read the documentation to learn how to use MTP with Gemma 4. You can download the model weights right now on Hugging Face, Kaggle, and start experimenting with faster inference with transformers, MLX, VLLM, SGLang, and Ollama or try them directly on Google AI Edge Gallery for Android or iOS.
We can't wait to see how this newfound speed accelerates what you build next in the Gemmaverse.
https://github.com/ollama/ollama/pull/15980
Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0
However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.
The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)
Then a few weeks back, I gave it another try and I was pleasantly surprised.
It was insanely good!
A colleague and I have been on-and-off trying to build a C++ binary against specific Google libraries for months without success. Then, Gemini CLI was able to build the binary after 2-3 days iterating and refining prompts
But last month I picked it up again and it has crushed everything I've thrown at it. As Codex limits tighten on the Plus plan it's been my main fallback and doesn't even feel like a downgrade when I switch over. Haven't hit a single loop so far using it nearly every day for several weeks so that problem seems solved finally, thank god.
I've been using it in the auto router mode and haven't felt the need to manually lock in the bigger model yet. It's incredibly snappy which I realized I really appreciate vs. waiting around endlessly for minutes each turn, but I've read other people's experiences needing to manually select the Pro model so YMMV.
I either vibe code a whole personal project, or strongly direct it to generate individual changes. It's fine for both.
The Pro model is the only good model for complex code and I think it's slower than Claude and Codex.
Likely there's a lot of dynamic tweaking of model quality. Rate limits are still fine for me at least.
This is the crux. What makes the guess "right"?
I think the acceptance criteria is not that the token is exactly the token the big model would have produced. It's accepted of the big model verifies that the probability of that token was high enough.
How close it is to the same output (or same distribution of outputs) you'd get from running the big model would be dependent on temperature, top-k, top-p settings, or other inference parameters.
best is to use your own model router atm, depending on the task
Technically usable but with bad/broken code. I found 3 different bugs with 1 feature, found a duplicate feature (their vibe coding missed the fact that the feature was already implemented), and the docs were wrong. Other features were ridiculously badly implemented. Reported them all, submitted multiple changes. None were accepted. Their repo was a hellscape of AI-generated issues and AI-generated PRs; I think mine was the only one written by a human. This was a month and a half ago.
Google is one of the most valuable corporations in the world, yet even they shipped a turd of an app to real customers and can't even take a bug fix. I think AI coding might be cooked.
I do not use super broad prompts, though. None of this "build me a webapp" stuff. It's more like, "adjust this part of this class to do Y instead of X."
Even with pro, I have caught it going off the rails a few times. The most frustrating was when I asked it to do translations, and it decided there were too many to do so it wrote a python script that ran locally and used some terrible library to do literal translations, and some of them were downright offensive and sexual in nature. For translations though, Gemini is the best but you have to have it do a sentence or two at a time. If you provide the context around the text, it really knocks it out of the park
its gotten much better on token limits and up time.
i recently reran a screenshot heavy task that i had last run in january, and it was able to keep running overnight and maybe peaked at 40% quota at any time, vs last time id need to resume it maybe twice to get the task to completion
There are definitely differences in the eagerness to tool-call that you'll need to manage. And for all local models I've ever used, I've had to micromanage the tools provided by servers to eliminate any possibility that they reach for something wonky or confusing.
Gemma4 chat template seems to had multiple issues, at least with llama.cpp, not sure they're all fixed yet. It assumed simple types for parameters for example.
[retain(8), delete(6), insert("very very"), retain(10)]
theturtle32@ai1:~$ ollama run gemma4:31b-coding-mtp-bf16 pulling manifest Error: pull model manifest: 412: this model requires macOS
I just asked: Write the operational transformation sequence and command to turn āthis is really beautifulā to āthis is very very beautifulā
and in return got: You can map this out by moving a virtual cursor across the text and telling it what to keep, remove, or add. You start by retaining the first eight characters to keep "this is " untouched. Then you delete the next six characters to remove the word "really". In that exact spot, you insert the nine characters for "very very". You finish the operation by retaining the final ten characters, which preserves the space and the word "beautiful". You can code this specific command sequence as [retain(8), delete(6), insert("very very"), retain(10)].
In a large paragraph of text I would expect it to be way quicker and cheaper to generate ā[retain(800), delete(6), insert("very very"), retain(10000)]ā than repredict the entire remainder of the unedited text.
If true that would suggest gemini/gemma would be great in a RAG situation where world model isnāt needed as itās being spoonfed all the relevant information and less good at green field tasks.
Thatās interesting to me because I have been struggling to understand how gemma4 is so good in my local use and how notebookLM does such a great job does when I give it project docs and yet gemini has always seemed behind claude when I use it cold for stuff.
It's like branch prediction - the CPU predicts what branch you'll take and starts executing it. Later you find out exactly what branch you took. If the prediction was correct, the speculative executed code is kept. If the prediction was wrong, it's thrown away, the pipeline is flushed, and the execution resumes from the branch point.
The same with this thing: 3 tokens, A-B-C were "predicted", you start computing ALL them 3 at the same time, hoping that the prediction checks out. And because of the mathematical structure of the transformer, it costs you almost the same to compute 3 tokens at a time or just one - you are limited by bandwidth, not compute. But CRITICALLY, each token depends on all the previous ones, so if you predicted wrongly one of the tokens, you need to discard all tokens predicted after (flush the pipeline). This is why a prediction is required and why you can't always compute 3 tokens simultaneously - the serial dependency between consecutive tokens. If you were to start computing 3 tokens simultaneously without a prediction, for token C you need to assume some exact values for tokens A and B, but those were not computed yet! But if they were speculatively predicted you can start and hope the prediction was correct.
One simple example is you can use @ to reference filenames - but the file list is cached and never updates. Ask Gemini to split a file into two files, then type @ and the new files will never appear. Those kind of extremely basic bugs.
But hey, the text has gradient colours...
Gemini 3.1 and 3 flash are only good for more simple tasks and when work is not the important part of the project
Matching token that would've been picked without speculative decoding. That seems to be more or less agreed upon.
e.g. vLLM docs list tests they run to ensure that output doesn't change if spec. decoding is used: https://github.com/vllm-project/vllm/blob/main/docs/features...
But introducing some threshold to accept other high probability tokens is interesting idea.
The draft model quickly generates draft-token 1.
The main model then starts working on two tokens in parallel. It calculates token 1 based on the context, and token 2 based on the context + draft-token 1.
Once the two tokens have been generated, you can check whether the draft-token 1 from the draft model matches token 1 from the main model.
If they match, you have just calculated two tokens in the time it takes to generate one, because the calculation was done in parallel. If they do not match, delete token 2 and generate it again. Since you have already generated the correct token 1 with the big model, you can use the context + token 1 (from the main model). This takes more time, but the result is always the same.
note that it will sometimes fall back to flash 2, which sucks
I am asking because I am very frustrated with the new quotas and I am hoping to get more mileage out of my subscription.
Heck, I'm still a fan of Gemma 2 9B.
They seem to be doing well. I checked recently and their API is closed to signups due to overwhelming demand.
The AI space moves so fast! I'll check it out again.
The paper they link to in that first paragraph says you compare logits to accept or reject.
I haven't yet compared either to Gemma 4. I tried that out the day after it came out with the patched llama.cpp that added support for it but I couldn't make tool calling work and so it was kind of useless. I should try again to see if things have changed but judging by what people say, qwen-3.6 seems stronger for coding anyway.
Inference parameters select a token using those.
You can just select the top token all the time or you can do it probabilistically.
How you do that in both the speculative decoding and the main inference changes how likely you get the exact same tokens. And then you can choose to accept only if the token matches exactly, or you can choose to accept if it was reasonably likely to be chosen.
Let's say the main model picked the 2nd most likely token and speculative picked the most likely. You can reject that - but you get less speed up. You can accept it, you get more speed up, but you do change the output. You risk the distribution of your outputs not being what you hope.
I am simplifying. I know in https://arxiv.org/pdf/2302.01318 they specify a probability that you reject a token.
Whether it succeeds now depends a lot on the rate of improvement of model architecture. They're betting on model design and capability improvements slowing down - and then wiping the floor with everyone else with their inference economics.
Pro is expensive, but good. However they've decreased the pitiful stipend they used to include in even the ultra plan to the point were it's barely usable. I pivoted back to ChatGPT Pro after the recent downgrade they gave Ultra users. Googles Ultra plan cost 2.5x as much and delivers about half the usage.
As far as I know, this is not used in practice. Currently popular implementations always match the main model output, and the draft model only affects the speed.
Thanks for the laugh. :)
class Foo {
// ....
int calculation() {
return 42;
}
// more stuff
}
where the main model emits something that is a sort of casual under-specified diff format and the merge model figures out how to interpret it as a patch.