Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

I just ran one of these locally on a Mac like this:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu \
    --prompt="Generate an SVG of a pelican riding a bicycle"

The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm

It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --vision-backend gpu \
    --attachment image.jpg --prompt describe

And for audio:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --audio-backend cpu \
    --attachment audio.wav --prompt transcribe

(The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )

Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article.

Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.

[0] https://huggingface.co/collections/unsloth/gemma-4-qat

[1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.

It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.

Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.

How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)

Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.

had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI

The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!

Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?

For a moment I got excited thinking QAT is Intel Quick Assist Technology...

@google.com'ers, there are no GGUFs (blog says there is)

I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases.

Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.

Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.

Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.

[0] https://huggingface.co/collections/unsloth/gemma-4-qat

[1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular.

meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

Like storing small 8 bit numbers in full 32 bit integers.

So it's not close to 100% of unquantized BF16.

I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue.

That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit...

I'm confused, the unsloth model is ~600mb and the one from google is 7gb?

I just ran one of these locally on a Mac like this:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu \
    --prompt="Generate an SVG of a pelican riding a bicycle"

The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm

It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --vision-backend gpu \
    --attachment image.jpg --prompt describe

And for audio:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --audio-backend cpu \
    --attachment audio.wav --prompt transcribe

(The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )

How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)

had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI

The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!

It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.

I'm not sure why you think it's awkward to have multiple releases. It's better to release models and variations as they're ready, not withhold them all until everything is ready to release all at once.

The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.

not sure if I understand you, but 4Q and QAT 4Q are different

Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.

Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?

Google already released specialized drafters for Gemma 4.

For a moment I got excited thinking QAT is Intel Quick Assist Technology...

@google.com'ers, there are no GGUFs (blog says there is)

Same I had to do a double take. Would be pretty humourous if they somehow took advantage of crypto offloading to accelerate ai inference

you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular.

meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

Like storing small 8 bit numbers in full 32 bit integers.

So it's not close to 100% of unquantized BF16.

> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.

https://developers.googleblog.com/en/gemma-3-quantized-aware...

"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."

The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.

Ah I see, thanks for the clarification.

I'm confused, the unsloth model is ~600mb and the one from google is 7gb?

One is quantized, the other one is Quantization-ready.

I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases.

Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.

Google already released specialized drafters for Gemma 4.

Same I had to do a double take. Would be pretty humourous if they somehow took advantage of crypto offloading to accelerate ai inference

> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

https://developers.googleblog.com/en/gemma-3-quantized-aware...

The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.

Ah I see, thanks for the clarification.

The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.

This is safetensors. Is there any way to run these on a Mac paired with the MLX QAT?

(Pardon my ignorance; this stuff moves so fast)

not sure if I understand you, but 4Q and QAT 4Q are different

It's super annoying when you have products that utilize these because there's...4? releases in 3 weeks?

- Gemma 4 2B/4B/27BE3B/31B

- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)

- Gemma 4 12B (2 days ago? 1?)

- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)

It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.

Extremely glad for the output, not glad to have to chase it.

ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.

Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)

EDIT: And, yes, the QAT 12B x mmproj does not work with llama.cpp. I'm glad there's people who have the luxury of not having to, well, actually use these and treat me as whining :) I'll need to schedule another 4-8 hours of work for the 4th time, no fun!

One is quantized, the other one is Quantization-ready.

Ah, nice, ty! My excuse is those repos were added to the collection after my comment, but perhaps not :3

This is safetensors. Is there any way to run these on a Mac paired with the MLX QAT?

(Pardon my ignorance; this stuff moves so fast)

It's super annoying when you have products that utilize these because there's...4? releases in 3 weeks?

- Gemma 4 2B/4B/27BE3B/31B

- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)

- Gemma 4 12B (2 days ago? 1?)

- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)

It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.

Extremely glad for the output, not glad to have to chase it.

These models aren't products? They are open source ish (open weight I guess), research outputs. While the naming scheme may be confusing, it is relevant and important. I believe it's on you to understand it.

Just use Unsloth Studio it supports them all.

Ah, nice, ty! My excuse is those repos were added to the collection after my comment, but perhaps not :3

Just use Unsloth Studio it supports them all.

I understand it. :)

And you're absolutely right to point out they aren't products - I hoped that was clear - when you're building a product with them, you end up having to do the same build loop 4 times, in this instance :)

I understand it. :)

You can stop after the first one. Choosing to repeat the process is on you, and probably because you see some benefit in using the variant(s) you build on top of.

Your browser does not support the audio element.

Listen to article

This content is generated by Google AI. Generative AI is experimental

[[duration]] minutes

Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to accelerate inference, and just a couple of days ago, we released a 12B model to bridge the gap between our E4B and 26B MOE models.

Today, we are releasing new checkpoints optimized with Quantization-Aware Training (QAT) to make Gemma 4 even more efficient, so you can run models locally on everyday edge devices and consumer GPUs.

By simulating quantization during training, QAT minimizes quality loss when the model is compressed. This release includes QAT checkpoints for the popular Q4_0 quantization format as well as a novel quantization format specialized for mobile use cases. Using this mobile format, we’ve reduced the memory footprint of Gemma 4 E2B to 1GB. Together, these dramatically reduce memory requirements while preserving the capabilities and quality you expect from Gemma 4.

Keeping model quality while making them smaller

Quantization is a key technology to run models on consumer hardware by reducing their memory footprint while also accelerating decode speed. However, standard Post-Training Quantization (PTQ) often leads to performance degradation. Instead of simply quantizing the model after training, QAT integrates the quantization process directly into training. While PTQ is already effective at preserving quality, our QAT results yield even higher overall quality compared to standard PTQ baselines.

We applied this QAT recipe to the popular Q4_0 format to maximize performance for all the models. For the edge models (E2B and E4B), we rethought how we approach quantization with a special mobile-specialized quantization schema.

Saving on VRAM and Storage

Below are the approximate memory requirements indicating how much VRAM is required to load the models:

Approximate memory requirements indicating how much VRAM is required to load the models.

Optimizing for mobile devices under the hood

Standard compression formats are often hard for mobile processors to run efficiently. To ensure Gemma 4 performs smoothly on mobile, we engineered a custom mobile-quantization schema designed for edge hardware:

Static activations: Normally, models waste processing power calculating how to scale data on the fly. We pre-calculate these settings during training, which reduces workload on mobile chips and makes responses faster.
Channel-wise quantization: We structured the compressed data to fit the design of mobile accelerators. This allows the phone to run calculations natively without needing slow workarounds.
Targeted 2-bit quantization: We heavily compressed (to 2-bit) the specific parts of the model that generate tokens, while keeping the core reasoning layers at higher precision. This saves storage without making the model less smart.
Embedding and KV cache optimization: We focused compression on the model’s vocabulary list and its short-term memory. This drastically reduces the active memory footprint, letting you have long chats without running out of space.

Because our audio and vision encoders are not needed in many use cases, you can optimize your memory footprint even further by deploying only the modalities you need. For example, the Gemma 4 E2B text-only model (without Per-Layer Embeddings) requires less than 1 GB of memory.

Get started today

To make those models easily usable with your preferred workflow, we’ve partnered with popular developer tools across the ecosystem to seamlessly support the Gemma 4 QAT checkpoints starting today:

Download the weights: Access the Q4_0 and mobile model weights right now on Hugging Face. We've tailored the formats to fit your workflow: GGUF formats are ready for use with llama.cpp, and compressed tensors are provided for vLLM. For everything else, we share unquantized checkpoints that can be converted and quantized into formats supporting Q4_0.
Integrate & learn: Explore our documentation to learn how to best deploy the QAT checkpoints.
Try on your desktop: Easily download, manage, and run Gemma 4 QAT models locally on your desktop using user-friendly interfaces like llama.cpp, Ollama and LM Studio.
Deploy on-device: Use Google's lightweight LiteRT-LM runtime for optimized edge deployment or run the models directly on the web with Transformers.js
Use your favorite development tools: Serve larger models efficiently with SGLang and vLLM, optimize for Apple Silicon with MLX. Use the MTP QAT checkpoints to preserve the speedup of MTP while quantizing the models. Fine-tune weights directly using Hugging Face Transformers and Unsloth.

We can't wait to see what you build with Gemma 4 running locally!

Hacker Times