A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.
Is it simply goodwill and/or marketing? Or am I missing something strategic?
> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.
That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...
> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.
I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.
[0] https://ollama.com/library/gemma4/tags
Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.
I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.
I wonder how hard it would be to add it back on.
I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)
Wait, *Excluding Chinese language.
This is ... curious.
P.S. Where is gemma 4 124b?
This is often a separate module grafted onto the main model, and further pre-trained (e.g. OpenAI's CLIP, SigLIP used in the Gemma 3 and PaliGemma series).
The image encoder approach has a few problems.
One problem is that many like Gemma 3's encoder have fixed image resolution constraints and inputs must be resized with all the attendant distortions that causes with spatial understanding. However, the Gemma 4 series image encoders overcame this and can handle variable-dimension inputs.
Two, these image encoders are somewhat large (ranging from 300-500M parameters) requiring extra memory and FLOPs to run.
Three, say we need to fine-tune a vision language model, updates to its weights, may affect its understanding of the representations generated by the image encoder if we don't fine-tune both together.
The new Gemma-4-12B replaces the encoder (with its many attention layers and large parameter count) with a simple linear projection to generate the embeddings for images. That reduces the computational requirements and simplifies the input pipelines for image processing.
I don't have any expertise on the topic though and might very well be wrong on some details.
They rise with the tide of AI adoption. But they gain ground if people opt into Google solutions. And any token sent to a Google model (free or paid) actively punishes their competitors that are then required to spend vast sums to remain bleeding edge.
They remind me a bit of HuggingFace, create something great then make money … maybe.
https://developers.google.com/edge/gallery
Anyone with a 16GB Mac — that is quite a lot of journalists, surely — can download that, install a model into it, and play.
Surely journalists have to start asking questions at least about OpenAI's consumer revenue projections now.
I really like the 26B Gemma model — I've been using it to teach myself some fundamental things and learn opencode without developing a cloud dependency. It writes fairly good code. But if this 12B model is even half as close as they say it is, this casts some doubt on the consumer end of things.
(Not clear if this app is using the MTP drafters; I've still not got them working with Gemma myself, though the Qwen 3.6 built-in MTP support is super in LM Studio)
That's my experience right now... my company is all in on a plethora of platform products. Also, Microsoft just yesterday said their goal was "Unmetered intelligence". There's a lot of things that can be enabled by small local models, and those things are part of stacks that can generate revenue in other layers.
If that inference becomes popular and valuable enough that those companies make billions of dollars in profit, those companies could use that profit to fund the building of alternative products and platforms that dis-intermediate google's relationship with the customer.
Google already has an 80% gross margin business, the biggest one in the world. Everybody wants a slice of it.
By offering frontier inference closer to cost and open-sourcing everything that's sub-frontier, they're commoditizing frontier labs' models, which inhibits their ability to durably make high gross margins on inference.
It's a strategic play.
So it's easier to just release those models as open source and make it official, since someone would inevitably hack the weights out anyway.
The question is: do you want to release your models, or use them purely for R&D?
Since everyone else is already releasing models of similar qualities, it's hard to say you're shooting yourself in the foot if you join the chorus.
The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.
Ideally companies would share the fucking datasets and training code already, but no, no one wants to talk about the source of those or even share the ones they have as then who knows what comes out of Pandora's box...
Of course it is...
This is Windows-Licensing-Level Money Opportunity 2.0.
They could lock them down legally which would prevent commercial use, but they choose not to, and they boast about how many tens of millions of times Gemma models have been downloaded by developers.
So there must be more to the rationale than just local model weights getting hacked out of devices.
Nobody would be looking at Qwen if their ~30b class models weren't fantastically good, it's great advertising and builds significant goodwill with developers, who are going to be your biggest advocates.
The other thing is, all these models are already disposable grade, and in a year they'll all be outclassed by The Next Big Thing. "Open" models are less than 18 months behind SOTA right now and I can't imagine that will slow down much over the next two years, they may even begin to close the gap. Nobody even talks about llama 4 anymore despite only being a year old.
FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818
I've been waiting for something like this to be released since then.
The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).
12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?
But TBD how well the base model performs before thinking too much about quantization
- Transcribing scanned documents into formatted text
- Captioning/describing images and classifying them for audience suitability (includes anti-spam)
- Matching documents with relevant Wikipedia pages for tagging
I don't use them like frontier models. I break the work down into micro-tasks with one clear goal for each prompt. I write a lot of glue software to make the complete flow work. I was working on all of these tasks before LLMs appeared on the scene. The LLMs have allowed me to replace a lot of complicated code with less code plus a model, while achieving better results.
I use local models for reasons of cost and control. I already had the workstation and GPU. The only running cost is electricity. I have used proprietary models from OpenAI and Google for some of these tasks, but I also encountered churn when the models I built my tools around were retired. I don't worry about that when I have the weights saved locally.
But between same (V)RAM requirement 4 bit 26B-A3B and 8 bit 12B it's unclear which one will win, especially given one is MoE and the other dense.
All the launch benchmarks are at 16 bit.
If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.
I am not overly impressed with the smaller gemma models. And gemma 3 was a bit of a mixed bag, great at some things, bad at most others
In practice I haven't got around to building something around multimodality since I'm primarily using their text generation capabilities.
I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.
There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.
Companies don't commonly give away executable binaries "just because", why'd they start now for these binary blobs that are the models?
Not that I'm unhappy about it! Yay for open data any day, I'm just not understanding why, at least beyond PR in nerd circles
So perhaps another part is just Google showing that they can indeed play at the big boys table.
I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.
Edit: about the actual release on Ollama, if you're on non-Apple hardware you probably want the NVFP4 variant ("gemma4:12b-nvfp4") which was uploaded 45 minutes ago, especially if you're with a recent nvidia GPU.
> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.
Consumers were complaining about the standard 8GB with the early 2020 refresh of MacBook Pros, many OSes ago. Sure, it might be workable for many tasks (as evidenced by the recent sales of the MacBook Neo), but users with a mere 8GB shouldn't have expectations of LLM performance. Even 16GB feels like a stretch.
> By offering frontier inference closer to cost *and* open-sourcing everything that's sub-frontier
It's two prongs! One prong is that their frontier inference pricing is significantly cheaper/closer-to-at-cost as Anthropic's.
The subject of this thread is the other prong: offering compelling models that are sub-frontier and self-hostable.
Self-hosting models and at-cost frontier models are the high-end and low-end disruptions, respectively, to Ant/OAI/etc.'s business models.
They need one more than ever now.
This is ridiculously anti-competitive.
After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.
Tokens create and hide too many problems to be the 'optimal' solution.
Jun 03, 2026
Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning.
Olivier Lacombe
Director of Product Management, Google Deepmind
Gus Martins
Product Manager, Google DeepMind

Your browser does not support the audio element.
Listen to article
This content is generated by Google AI. Generative AI is experimental
[[duration]] minutes
Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs.
Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads. You’ve built everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We're excited to see what you build with this latest addition.
Here’s an overview of what makes Gemma 4 12B unique:
Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Let's now take a closer look at how Gemma 4 12B achieves this.
Gemma 4 12B delivers performance nearing our larger 26B MoE model on standard benchmarks, but at less than half the total memory footprint. Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

What makes Gemma 4 12B stand out is its streamlined approach to processing visual and audio inputs. Traditional multimodal models typically rely on separate encoders to translate images and audio before passing those representations to the language model. Because these split encoders add latency and increase memory usage, we trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly.
Here is how Gemma 4 12B processes multimodal inputs natively:
For developers who want a breakdown, head over to our companion Gemma 4 12B Developer Guide.
Related stories
.