In our little known, difficult to game benchmarks, it scored about as well as GPT 5.2 and Gemini 3 Pro Preview on one-shot coding problems. It had me re-reviewing our entire benchmarking methodology.
But it struggled in the other two sections of our benchmark: agentic coding and non-coding decision making. Tool use, iterative refinement, managing large contexts, and reasoning outside of coding brought the scores back down to reality. It actually performed worse when it had to use tools and a custom harness to write code for an eval vs getting the chance to one-shot it. No doubt it's been overfit on common harnesses and agentic benchmarks. But the main problem is likely scaling context on small models.
Still, incredible model, and incredible speed on an M-series Macbook. Benchmarks at https://gertlabs.com
I tried running the same on an M3 Max with less memory, but couldn't increase the context size enough to be useful with Opencode.
It's also easy to integrate it with Zed via ACP. For now it's mostly simple code review tasks and generating small front-end related code snippets.
I'm really surprised how that was not obvious.
Also, instead of limiting context size to something like 32k, at the cost of ~halving token generation speed, you can offload MoE stuff to the CPU with --cpu-moe.
“Write a single file web page that implements a 1 dimensional bin fitting calculator using the best fit decreasing algorithm. Allow the user to input bin size, item size, and item quantity.”
Qwen3.5, Nematron, Step 3.5, gpt-oss all passed first go..
1) Pin to an earlier version of codex (sorry) - 0.55 is the best experience IME, but YMMV (see https://github.com/openai/codex/issues/11940, https://github.com/openai/codex/issues/8272).
2) Use the older completions endpoint (llama.cpp's responses support is incomplete - https://github.com/ggml-org/llama.cpp/issues/19138)
If you're just chatting or doing less precise things it's 1000% worth it going down to Q8 or sometimes even Q4
What I would like is for it to be able detect when these things happen and to "Phone a Friend" to a smarter model to ask for advice.
I'm definitely moving into agent orchestration territory where I'll have an number of agents constantly running and working on things as I am not the bottleneck. I'll have a mix of on-prem and AI providers.
My role now is less coder and more designer / manager / architect as agents readily go off in tangents and mess that they're not smart enough to get out of.
In fact, I started using it as a coding partner while learning how to use the Godot game engine (and some custom 'skills' I pulled together from the official docs). I purposely avoided Claude and friends entirely, and just used Gemma4 locally this week... and it's really helped me figure out not just coding issues I was encountering, but also helped me sift through the documentation quite readily. I never felt like I needed to give in and use Claude.
Very, very pleased.
Rubbish, we have been calling tools locally for 2 years, and it's very false that gemma3 scored under 7% in tool calling. Hell, I was getting at least 75% tool calling with llama3.3
Something like:
* Human + Claude Opus sets up project direction and identifies research experiments that can be performed by a local model
* Gemma 4 on local hardware autonomously performs smaller research experiments / POCs, including autonomous testing and validation steps that burn a lot of tokens but can convincingly prove that the POC works. This is automatically scheduled to fully utilize the local hardware. There might even be a prioritization system to make these POC experiments only run when there's no more urgent request on the local hardware. The local model has an option to call Opus if it's truly stuck on a task.
* Once an approach is proven through the experimentation, human works with Opus to implement into main project from scratch
If you can get a complex harness to work on models of this weight-class paired with the right local hardware (maybe your old gaming GPU plus 32gb of RAM), you can churn through millions of output tokens a day (and probably like ~100 million input tokens though the vast majority are cached). The main cost advantage compared to cloud models is actually that you have total control over prompt caching locally which makes it basically free, whereas most API providers for small LLM models ask for full price for input tokens even if the prompt is exactly repeated across every request.
Gonna run some more tests later today.
My early takeaway is that Gemma 26B-A4B is the best tuned out of the bunch, but being small and with few active params, it's severely constrained by context (large inputs and tasks with large required outputs tank Gemma 26B's performance). We're working on a clean visualization for this; the data is there.
It's not uncommon for a sub-release of a model to show improvements across the board on its model card, but actually have mixed real performance compared to its predecessor (sometimes even being worse on average).
The rate limiting step is the LLM going down stupid rabbit holes or overthinking hard and getting decision paralysis.
The only time raw speed really matters is if you are trying to add many many lines of new code. But if you are doing that at token limiting rates you are going to be approaching the singularity of AI slop codebase in no time.
I am using a 24GB GPU so it might be different in your case, but I doubt it.
The system prompt and tools have very little overhead (<2k tokens), making the prefill latency feel noticeably snappier compared to Opencode.
[0] https://www.npmjs.com/package/@mariozechner/pi-coding-agent#...
Now both codex and opencode seem to work.
So if you have not updated your model, you should do it.
It's interesting - imo we'll soon have draft models specifically post-trained for denser, more complicated models. Wouldn't be surprised if diffusion models made a comeback for this - they can draft many tokens at once, and learning curves seem to top out at 90+% match for auto-regressive ones so quite interesting..
Banking, scientific data analysis, sales, etc. Everything uses and manipulates csvs.
So it was firmly in the bottom quartile of difficulty - and there LLMs actually do quite well.
I (a hobbyist running a small side project for a dollar or two a month in normal usage, so my account is marked as "individual") got hit with a ~$17,000 bill from Google cloud because some combination of key got leaked or my homelab got compromised, and the attacker consumed tens of thousands in gemini usage in only a few hours. It wasn't even the same Google project as for my project, it was another that hasn't seen activity in a year+.
Google refuses to apply any adjustments, their billing specialist even mixed up my account with someone else, refuses to provide further information for why adjustments are being rejected, refuses any escalation, etc. I already filed a complaint with the FTC and NYS attorney General but the rep couldn't care any less.
My gripe is not that the key was potentially leaked or compromised or similar and then I have to pay as a very expensive "you messed up" mistake, it's that they let an api key rack up tens of thousands in maybe 4 hours or so with usage patterns (model selection, generating text vs image, volume of calls, likely different IP and user agent and whatnot). That's just predatory behavior on an account marked as individual/consumer (not a business).
Technically, I use OpenWebUI with Ollama, so I used the weights below, but it should be the same.
https://ollama.com/kwangsuklee/Qwen3.5-27B-Claude-4.6-Opus-R...
FYI the latest iteration of that finetune is here: https://huggingface.co/Jackrong/Qwopus3.5-27B-v3
Or maybe the author has been running heavily quantized small models all that time — Gemma 4 gguf he's using is Q4 and only 16 GB. In my experience quants like this tend to perform much worse.
I also recommend anyone with a GB10 device to go try out the spark-vllm-docker setup, and check the Nvidia GB10 forums for the recently released optimised Qwen 3.5 122B A10B setup: 50tk/s is quite impressive for a decent local model!
I also find that you can coerce a wide spectrum of otherwise declined queries by editing its initial rejection into the start of an answer. For example changing the "I'm sorry I can't answer that..." response to "Here's how..." And then resubmitting the inference, allowing it to continue from there. It's not perfect, sometimes it takes multiple attempts, but it does work. At least in my experience. (This isn't Gemma-specific tip, either. Nearly every model I've tried this with tends to bend quite a bit doing this.)
I tend to use Huihuiai versions.
So yes, do purchase that new MacBook Pro.
As you have so much RAM I would suggest running Q8_0 directly. It's not slower (perhaps except for the initial model load), and might even be faster, while being almost identical in quality to the original model.
And just to be sure: you're are running the MLX version, right? The mlx-community quantization seemed to be broken when I tried it last week (it spit out garbage), so I downloaded the unsloth version instead. That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch of https://github.com/ml-explore/mlx-lm.
I unfortunately only have 16 GiB of RAM on a Macbook M1, but I just tried to run the Q8_0 GGUF version on a 2023 AMD Framework 13 with 64 GiB RAM just using the CPU, and that works surprisingly well with tokens/s much faster than I can read the output. The prompt cache is also very useful to quickly insert a large system prompt or file to datamine although there are probably better ways to do that instead of manually through a script.
Anyway, is it possible that this may be what lies behind Gemma 4's "censoring"? As in, Google took a deliberate choice to focus its training on certain domains, and incorporated the censor to prevent it answering about topics it hasn't been trained on?
Or maybe they're just being sensibly cautious: asking even the top models for critical health advice is risky; asking a 32B model probably orders of magnitude moreso.
That's the idea behind distillation. They are finetuning it on traces produced by opus. This is poor man's distillation (and the least efficient) and it still works unreasonably well for what it costs.
That link doesn't have much affiliation with Qwen or anyone who produces/trained the Qwen models. That doesn't mean it's not good or safe, but it seems quite subjective to suggest it's the latest latest or greatest Qwen iteration.
I can see huggingface turning into the same poisoned watering-hole as NPM if people fall into the same habits of dropping links and context like that.
Thanks to the settings suggestions in the article, I was able to squeeze in the 31b model. Still testing, but it's real tight in 24gb of vram. A bit slower, too, but usable. Not sure I'm seeing much of a quality boost yet, but I'm still testing.
Unfortunately I have got zero success running gemma with mlx-lm main branch. Can you point me out what is the right way? I have zero experience with mlx-lm.
On the 48GB mac - absolutely. The 24GB one cannot run Q8, hence why the comparison.
> And just to be sure: you're are running the MLX version, right?
Nah, not yet. I have only tested in LM Studio and they don't have MLX versions recommended yet.
> but has since been fixed on the main branch
That's good to know, I will play around with it.
That is a bad premise and a false dichotomy, because most medical questions are simple, with well-known standard answers. ChatGPT and Gemini answer such questions correctly, also finding glaring omissions by doctors, even without having to look up information.
As for the medical questions that are not simple, the ones that require looking up information, the model should in principle be able to respond that it does not know the answer when this is truthfully the case, implying that the answer, or a simple extrapoloation thereof, was not in its training data.
Your explanation would make sense if various other rare domains were also censored, but they aren't, so it doesn't.
> asking even the top models for critical health advice is risky
Not asking, and living in ignorance, is riskier. For high-stakes questions, of course I'd want references that only an online model like ChatGPT or Gemini, etc. would be able to find. If I am asking a local model for health advice, odds are that it is because I am traveling and am temporarily offline, or am preparing off-grid infrastructure. In both cases I definitely require a best-effort answer. I also require the model to be able to tell when it doesn't know the answer.
If you would, ignore health advice for a moment, and switch to electrical advice. Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire. Why is electrical advice not censored, and what makes it not be high-stakes!? The logic is the same.
For the record, various open-source Asian models do not have any such problem, so I would rather use them.
Connecting Ollama to OpenCode and OpenWebUI is relatively trivial. In OpenWebUI there's a nice GUI. In OpenCode You just edit the ~/.config/opencode/opecode.json to look something like this. The model names have to match the ones you seen in OpenWebUI, but the friendly "name" key can be whatever you need to be able to recognize it.
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"qwen3.5:122b": {
"name": "Qwen 3.5 122b"
},
"qwen3-coder:30b": {
"name": "Qwen 3 Coder"
},
"gemma4:26b": {
"name": "Gemma 4"
}
}
}
}
}I'm saying it's the latest iteration of the finetuned model mentioned in the parent comment.
I'm also not suggesting that it's "the latest and greatest" anything. In fact, I think it's rather clear that I'm suggesting the opposite? As in - how can a small fine tune produce better results than a frontier lab's work?
The best image is the largest, takes up the most memory when loading, and while it is large and looks the best, it uses up much of your system resources.
On the other end of the spectrum there is a smaller much more compressed version of that same image. It loads quickly, uses less resources, but is lacking detail and clarity of the original image.
AI models are similar in that fashion, and the parent poster is suggesting you use the largest version of the AI model your system can support, even if it runs a little slower than you like.
If I was prepping, I’d want e.g. Wikipedia available offline and default to human-assisted decision-making, and definitely not rely on a 31B parameter model.
To be reductive, the ‘brain’ of any of these models is essentially a compression blob in an incomprehensible format. The bigger the delta between the input and the output model size, the lossier the compression must be.
It therefore follows (for me at least) that there’s a correlation between the risk of the question and the size of model I’d trust to answer it. And health questions are arguably some of the most sensitive - lots of input data required for a full understanding, vs. big downsides of inaccurate advice.
> If you would, ignore health advice for a moment, and switch to electrical advice. Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire. Why is electrical advice not censored, and what makes it not be high-stakes!? The logic is the same.
You’re correct that it’s possible to find other risky areas that might not be currently censored. Maybe this is deliberate (maybe the input data needed for expertise in electrical engineering is smaller?) or maybe this is just an evolving area and human health questions are an obvious first area to address?
Either way, I’m not trusting a small model with detailed health questions, detailed electrical questions, or the best way to fold a parachute for base jumping. :)
(Although, if in the future there’s a Gemma-5-Health 32B and a Gemma-5-Electricity 32B, and so on, then maybe this will change.)
That's a weird demand from models. What next, "Imagine I'm doing brain surgery and the model gives me bad advice", "Imagine I'm a judge delivering a sentencing and the model gives me bad advice", ...
The sentiment still applies the parent comment of yours though.
Secondly, the primary point was about censorship, not accuracy, so let's not get distracted.
Except with electrical stuff the unit test itself can put your life and others in danger.
I assumed it was more about risk management/liability than censorship.
7 min read
1 day ago
--
Press enter or click to view image in full size
Sketchnote comparing Gemma 4 local inference on a 24 GB MacBook Pro and Dell GB10, showing that model quality beats raw token speed for agentic coding.
I wanted to know whether Gemma 4 could replace a cloud model for my day-to-day agentic coding. Not in theory, in practice. I use Codex CLI every day, running GPT-5.4 as my default model. It works well, but every token costs money and every prompt sends my code to someone else’s server. I also have friends thinking seriously about spending real money on local setups, and so far I had not been convinced that would be useful for this kind of work. I was open to being wrong. Gemma 4 promised local tool calling that works. I spent a day finding out whether that held up once Codex CLI started reading files, writing patches and running tests.
I set up two machines. A 24 GB M4 Pro MacBook Pro, the laptop I carry everywhere, running the 26B MoE variant via llama.cpp in Q4_K_M because that was the highest practical fit in memory. And a Dell Pro Max GB10, 128 GB of unified memory on an NVIDIA Blackwell chip, running the 31B Dense variant via Ollama v0.20.5. Both configured as custom model providers in Codex CLI's config.toml with wire_api = "responses". Then I ran the same code generation task on both, and on the cloud model as a baseline.
By the end of the day I had both local setups completing the task, but only after a lot of time spent staring at stalled requests, broken tool calls and one Mac configuration that was much faster than its final result justified.
Three things pushed me towards local models. First, cost. I run Codex CLI heavily, multiple sessions a day, sometimes in parallel. The API bills add up. Second, privacy. Some of the codebases I work with should not leave my machine. Third, resilience. Cloud APIs throttle, go down and change pricing. A local model runs.
The reason I had not done this before is that local models could not call tools. Codex CLI’s entire value comes from the model reading files, writing code, running tests and applying patches. If the model cannot reliably emit {"tool": "Read", "args": {"file": "package.json"}}, it is useless as an agent. Previous Gemma generations scored 6.6 per cent on the tau2-bench function-calling benchmark. That is 93 failures out of 100. Not a foundation for anything.
Gemma 4 31B scores 86.4 per cent on the same benchmark. That is what made this test worth running.
Neither machine worked on the first attempt.
The Mac. I started with Ollama, because it is the simplest path. On my M4 Pro Apple Silicon setup, two bugs killed it immediately. A streaming bug in v0.20.3 routes Gemma 4’s tool-call responses to the wrong field, landing them in the reasoning output instead of the tool_calls array. Separately, a Flash Attention freeze hangs Ollama on any prompt longer than about 500 tokens with Gemma 4 on Apple Silicon. Codex CLI’s system prompt alone is roughly 27,000 tokens. In practice that meant the request would arrive, the prompt would start ingesting, and then nothing useful would happen.
I switched to llama.cpp, installed via Homebrew. The working server command has six load-bearing flags:
llama-server \
-m /path/to/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--port 1234 -ngl 99 -c 32768 -np 1 --jinja \
-ctk q8_0 -ctv q8_0
Every flag matters on 24 GB. I am no expert here, but I did spend quite a bit of time trying different options out. The -np 1 limits to a single slot, because multiple slots multiply KV cache memory. The -ctk q8_0 -ctv q8_0 quantises the KV cache, reducing it from 940 MB to 499 MB. The --jinja flag is required for Gemma 4's tool-calling template. And -m with a direct path avoids the -hf flag, which silently downloads a 1.1 GB vision projector that causes an out-of-memory crash.
The Codex CLI config also needs web_search = "disabled", because Codex CLI sends a web_search_preview tool type that llama.cpp rejects. I got to that point by reading error messages, checking GitHub issues and rerunning the same request with one flag changed at a time.
The GB10. I expected vLLM to work, as the plan I was following recommended it. It did not. vLLM 0.19.0’s compiled extensions are built against PyTorch 2.10.0, but the only CUDA-enabled PyTorch for aarch64 Blackwell (compute capability sm_121) is 2.11.0+cu128. Different ABI. ImportError at startup. I built llama.cpp from source with CUDA, and it compiled and benchmarked fine, but Codex CLI's wire_api = "responses" sends non-function tool types that llama.cpp rejects.
What worked was Ollama v0.20.5. On my GB10, the streaming bug that broke Apple Silicon did not reproduce on NVIDIA. ollama pull gemma4:31b, SSH tunnel to forward port 11434 to my Mac (because Codex CLI's --oss mode checks only localhost), and codex --oss -m gemma4:31b. Text generation and tool calling both worked on the first attempt.
The Mac setup took most of an afternoon. The GB10 took about an hour, most of it waiting for model downloads.
I gave all three configurations the same task through codex exec --full-auto: write a parse_csv_summary Python function with error handling, write tests and run them. This was a single practical spot check, not a statistically robust benchmark, but it was enough to compare failure modes inside the same Codex CLI workflow.
GPT-5.4 produced type-hinted code with proper exception chaining, boolean type detection and a clean helper function. Five tests passed first time in 65 seconds, and I did not have to clean anything up afterwards.
The GB10’s 31B Dense produced functional code without type hints or boolean detection, but with solid error handling and no dead code left behind. Five tests passed on the first attempt after three tool calls. Total time: seven minutes.
The Mac’s 26B MoE left dead code in the implementation, including a type inference loop written, abandoned in place, then rewritten below it with the comment ‘Actually, let’s simplify’ still in the source. The test file took five attempts to write. Each time the model introduced a different heredoc failure: filerypt instead of file_path, encoding=' 'utf-8' with a rogue space, fileint(file_path). Ten tool calls to accomplish what the GB10 did in three. That result should be read as a 24 GB, Q4_K_M, Codex CLI harness result, not as a universal verdict on Gemma 4 on Apple Silicon.
I ran llama-bench on both machines with the same context lengths.
The Mac generates tokens 5.1 times faster than the GB10. I did not expect that, because both machines have 273 GB/s LPDDR5X memory bandwidth.
The explanation is the Mixture of Experts architecture. Token generation is memory-bandwidth limited: every token requires reading the model’s active parameters from memory. The 31B Dense reads all 31.2 billion parameters for every token. The 26B MoE activates only 3.8 billion per token, roughly 1.9 GB at Q4 quantisation. The Mac pushes 1.9 GB per token through its 273 GB/s bandwidth and gets 52 tok/s. The GB10 pushes 17.4 GB per token through the same bandwidth and gets 10 tok/s. Same pipe, vastly different payload.
Prompt processing was the other result I had wrong in my head before I ran it. I expected the GB10’s Blackwell GPU to dominate, but the Mac held its own: 531 tok/s versus 548 tok/s at 8K context. The MoE’s sparse activation appears to help prompt processing too, not only generation.
I went into this assuming token speed would dominate the experience. On this task, it did not.
The Mac generated tokens 5.1 times faster. It still finished only 30 per cent sooner (4m 42s versus 6m 59s). The time went into retries: ten tool calls instead of three, five failed test writes and dead code the model did not clean up. The GB10’s slower model got it right first time.
The cloud model made the same point more sharply. It was fastest, used the fewest tokens and needed no repair pass. Five out of five in 65 seconds. For this workflow, first-pass reliability mattered more than raw generation speed.
But local is viable. Both machines produced working code with passing tests. The quality gap between Gemma 3 (6.6 per cent tool calling) and Gemma 4 (86.4 per cent) is the gap that matters. Going from ‘broken’ to ‘works’ is the step that makes local agentic coding practical. For the Mac result in particular, the caveat is quantisation: this was the highest-memory-fit Q4_K_M setup on a 24 GB machine, not a claim that every Gemma 4 deployment behaves this way. I have not rerun the same task yet at a higher quant on a roomier Apple Silicon machine, and I would expect that to matter.
I can see how a hybrid approach might be useful. codex --profile local for iteration and privacy-sensitive work. Default cloud for anything complex. Codex CLI's profile system makes switching a single flag.
A few specifics from the setup that will save you time.
On Apple Silicon, for the workload I tested, Ollama was not usable with Gemma 4. I would use llama.cpp with --jinja. Set web_search = "disabled" in your Codex CLI profile. Use -m with a direct GGUF path, not -hf. Set context to 32,768 (Codex CLI's system prompt needs at least 27,000 tokens) and quantise the KV cache with -ctk q8_0 -ctv q8_0.
On my NVIDIA GB10, Ollama v0.20.5 was the first path that worked reliably. Use codex --oss -m gemma4:31b. If the machine is remote, tunnel port 11434 via SSH.
Set stream_idle_timeout_ms to at least 1,800,000 in your provider config. A single tool-call cycle took one minute 39 seconds on the Mac. The default timeout will kill your session before the model finishes thinking.
And pin your llama.cpp version. A reported 3.3 times speed regression between builds means your benchmarks can change overnight.
Benchmarks were run on 12 April 2026 using Codex CLI v0.120.0. Mac: llama.cpp ggml 0.9.11 (build 8680) on a 24 GB M4 Pro MacBook Pro, model gemma-4–26B-A4B-it Q4_K_M. GB10: Ollama v0.20.5 on a Dell Pro Max GB10 (128 GB, NVIDIA Blackwell), model gemma-4–31B-it Q4_K_M. Cloud baseline: GPT-5.4 with high reasoning effort. All three ran the same prompt through codex exec --full-auto. Raw speed benchmarks used llama-bench.