I run something very similar except for directly using pi as the agentic harness I use little-coder that wraps pi with reasonable defaults for running local models. Even though my local setup is a bit slow, it is a thrill to do real work completely locally.
aivo pi hf:unsloth/gemma-4-26B-A4B-it-GGUF
It feels like a GPT-4 class model in terms of "stored knowledge" but is better at long-horizon tool calling than any of the GPT-4 class models.
Running on a 128GB MBP M4 Max, I'm getting ~24 t/s on generation and ~200 t/s on prefill. I was expecting it to feel slow, and it certainly does when e.g. generating code, but it's surprisingly useful as a "machine orchestrator" for simple tasks.
For non-agentic usecases, it's a decent enough model to converse with, and has the benefit of being entirely self-contained/private.
One way or another local AI is the future. I actually find weaker models more interesting because it keeps me sharp (at the cost of velocity of course).
harbor up omlx opencode
I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.
Fiddling about with local models has done so much for my conceptual understanding of what is going on.
FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.
Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.
LLAMA_CACHE="models" ./llama-server \
-hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
...You would not need to follow a blog post with omlx IMHO
Thats the rub. I have an M4 with 48G. I wonder if it is worth testing this out.
My past attempts (with Ollama and various LLMs) were too slow to use.
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
llama.cpp includes a tool specifically for benchmarking that will sweep the arguments for you so you don't have to restart the server and send it prompts:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
EDIT: Also the section about downloading the models should have mentioned that llama.cpp has a "-hf" argument that will download the models for you. I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.
Alas, this video appears not have been linked to the text that describes it. Perhaps I should ask an AI to generate an artistic rendering of the author's description.
is this article old? It's not. I'm not sure why he went through all the bother of llama.cpp
The Gemma 4 MLX builds I have found so far have been slower at the same quantisation and much slower with MTP.
The built-in web UI for llama.cpp is really quite good once you have chosen your model. Otherwise I quite like LM Studio for tinkering.
One thing I would say is that both Gemma-4 and Qwen 3.6 simply do not need a large chunk of the typical opencode system prompt. Better off without it.
I'm not Googling much of anything anymore. 9/10 times the information is awful, it's hard to parse out of whatever other spam it's surrounded by. Meanwhile, Claude will just do the thing one-shot or with a tiny bit of refinement.
The gateway to knowledge and getting stuff done is the LLM.
Google Search is a dinosaur.
It feels like we're living a century into the future. Not even smartphones were this cool.
But there is an incentive not to use it if you want to write an article that uses only open-source tools, because it isn't.
But if you just want to play around rather than code, you really might find the Gemma 4 12B model worth mucking about with just so you've gone through the steps. Especially if you want to muck about with image analysis or audio transcription.
If you're writing PHP I think you could even find it good enough. I've been modestly surprised. You can do that basic fiddling with the Edge AI Gallery app, which can enable thinking and has a customisable system prompt and some agent support.
You could also try the 14B Deepseek R1.
Honestly even if it is not good enough, if you are anything like me, I think you'll find that going through this process is really quite educational — it has made a lot of things more concrete for me in a way that I have found reassuring and valuable.
llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too.
So there is no value in testing quality of answers, but there is value in testing token speed.
You just have to have correct expectations.
oMLX does the caching I need to fit models that are near gross memory, and it handles most of the work in finding usable models. After cobbling together various solutions over months, I now just use oMLX, often from Xcode. I can tell the difference between Gemma-4 (local/free) and Claude (paid) only on the largest tasks.
I do enjoy their different personalities when they are tackling "explain this" type puzzles, though.
Gemma writes so well — like a concise code blogger. It makes you understand that the thing we hate about AI slop writing is specifically the cheesy, marketingese sycophantic ChatGPT tone. It's a choice to sound that way.
Qwen writes more tersely by default, like much english language documentation in Chinese open source projects. A couple of lines, code example, fact, code example, line of blurb.
I use this prompt every now and then with a new model. It's obviously a classic SQL puzzle but I've asked new web developers this in the past (prompted by discovering that a client's subcontractor didn't understand it and was therefore unable to migrate some code from relying on dodgy pre-MySQL 5.x behaviours)
—
I have a MySQL 5 table like this: [id, label, category, score]. It contains a list of items in different categories (text names like cat1, cat2, cat3) with a numerical score. Is there a way I can write a SQL query to find the item in each category that has the highest score, without using a subquery? No two entries in any category share a score.
—I enjoy seeing what it deduces from the subtext.
Without "thinking" mode on, they always initially fail and you need to prompt them to find the answer. With thinking mode, they both produce really nice explanations.
For me, as an old freelancer who is pretty cynical about vibe coding or "agentic engineering", what I really want is an AI tool that can help me start to solve problems and help me find the right terminology or generate some boilerplate I can tinker with. Both of these models do fine at the kind of "starter" writing that I want when I am trying to untangle an idea.
sbx policy set-default open
just so the single pi sandbox can talk to localhost? ... this gives me some grave doubts about the rest of it being set up well.Not knocking huggingface-cli, just find it's much easier for people to try out this stuff when they can just
mise use --global github:ggml-org/llama.cpp
LLAMA_CACHE="models" llama-server \
-hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
--host 0.0.0.0 \
--port 11434 \
...If you're seeking the kind of hands-off claude experience, obviously not. They are slow.
If you want to learn how these things work, train them locally, tinker, play with the code, grasp the fundamentals, or just out of sheer bloody-mindedness and principle refuse to tether the functioning of your application to a cloud API...
—no-mmproj
is also pretty useful if you're doing this just to try agentic coding and you're not processing images/voice. Stops it downloading the multimodal projector.https://newsletter.pessimistsarchive.org/p/when-educators-mo...
New decade, same old argument.
It's not
> "Claude, think for me"
It's
> "Claude, be my subordinate and get this done for me"
Instead of complaining on the sidelines, I'm getting a shit ton of work done.
> NoSuchKeyThe specified key does not exist…
It’s weird when people are proud of doing ton of work. Im the opposite, Im proud that Im doing minimal stuff without llms.
Yeah, good ol' present for me too then, thanks.
Nah, you are just producing a bunch of slop and hope that nobody notices.
An argument can be as old as the search engine and hold real value. There are ways in which unreflective search engine use has misled and mistrained people.
There’s always been argument to be had about how we manage and offload attention, what we gain and what we lose when resistance is reduced. It’s part of reflection that’s been necessary in order to make progress solid ground, and is more necessary with non-deterministic tech.
The phrase “Tactical tornados” may be older than web search and describes people who also got a lot done.
Models can be incredibly helpful boosters and situationally effective subordinates… and also patchy as a real engineering IC or org.
I'd had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the "Gemma 4 now runs 2x faster with MTP" Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running.
I wanted a local coding agent setup that:
And I did! This video is realtime. And shows the agent responding at a perfectly usable speed.
After a bit of testing the final setup I ended up with is:
This was tested on an Apple M1 Max with 64 GB unified memory, running macOS 15.7.7.
The main model is: gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf.
Link on Huggingface: models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
That file is about 16 GB. With the MTP draft head and multimodal projector the model folder is about 17 GB.
The benchmark prompt was:
Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
Each benchmark generated about 128 tokens.
First I ran the main model directly through llama.cpp with Metal acceleration:
repos/llama.cpp/build/bin/llama-cli \ -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \ -ngl 999 \ -fa on \ -c 4096 \ -n 128
Result:
| Setup | Prompt tok/s | Generation tok/s |
|---|---|---|
| Gemma 4 26B-A4B Q4, llama.cpp Metal | 298.0 | 58.2 |
58 tokens/second is not fast, but is usable, but for coding-agent work you want it to be as fast as possible, especially when the agent is making many tool calls.
Gemma 4 now has the MTP draft model available:
MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf
This can be loaded by llama.cpp as a speculative draft model:
repos/llama.cpp/build/bin/llama-cli \ -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \ --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ -ngl 999 \ -fa on \ -c 4096 \ -n 128
The first run with MTP came in at 69.2 tokens/second using 4 draft tokens. However, Unsloth's guide on How to Run MTP Models includes this note:
"We found --spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system."
After sweeping --spec-draft-n-max, the best result was 72.2 tokens/second with 3 draft tokens.
| Setup | Prompt tok/s | Generation tok/s | Speedup |
|---|---|---|---|
| Main model only | 298.0 | 58.2 | 1.00x |
| Main model + Q8 MTP draft | 295.6 | 72.2 | 1.24x |
The useful part is that prompt processing stayed basically the same, while generation improved by about 24%.
I tested --spec-draft-n-max values from 1 to 6.
--spec-draft-n-max |
Prompt tok/s | Generation tok/s |
|---|---|---|
| 1 | 295.5 | 68.4 |
| 2 | 299.1 | 72.0 |
| 3 | 295.6 | 72.2 |
| 4 | 297.3 | 70.7 |
| 5 | 297.9 | 63.7 |
| 6 | 296.3 | 61.2 |
On my M1 Max machine, 3 was the fastest, with 2 close enough that either would be fine. Values above that got slower.
I also tested MLX models through mlx-lm, to find out which is the faster way to run the model on a Mac, llama.cpp or mlx.
| Runtime | Model | Generation tok/s |
|---|---|---|
| llama.cpp Metal + MTP | Unsloth GGUF Q4 + Q8 MTP | 72.2 |
| llama.cpp Metal | Unsloth GGUF Q4 | 58.2 |
| MLX-LM | Unsloth UD MLX 4-bit | 45.8 |
| MLX-LM | mlx-community 4-bit | 43.9 |
| MLX-LM | mlx-community OptiQ 4-bit | 38.1 |
I thought MLX (being optimised for the Mac) would be fastest.
However, for this specific setup, llama.cpp was faster than MLX, and llama.cpp with MTP was clearly the best option.
I guess all the effort and tweaking which has gone into llama.cpp over time means it quite well optimised fr macOS despite being cross platform.
I also tried Gemma 4 MTP through gemma-4-swift-mlx, but the tested 26B 4-bit MLX checkpoints did not match the loader's expected weight keys, and I already had the previous MLX tests, so moved on rather than redownload new models and try to tweak things to match.
For Pi, I also wanted to be able to attach screenshots. The local model entry I setup for it originally declared the model as text-only:
"input": ["text"]
That meant Pi did not send image tool output through to the model properly.
The llama.cpp server also needs the Gemma 4 multimodal projector in order for the multi-modal part to work (only the 12B is natively multi-modal):
mmproj-BF16.gguf
When loaded with --mmproj, llama.cpp advertises multimodal support, and Pi can send images.
I re-ran the text benchmark with the projector loaded, just to check it didn't change the speed:
| Setup | Projector | Prompt tok/s | Generation tok/s |
|---|---|---|---|
| llama.cpp Metal + MTP | none | 120.3 | 71.4 |
| llama.cpp Metal + MTP | mmproj-BF16.gguf |
297.4 | 72.2 |
The final run with the projector did not show a text-generation slowdown.
Now for setup instructions:
Install dependencies:
brew install cmake git tmux python@3.11
Clone and build llama.cpp:
mkdir -p ~/Developer/ML-Models/Gemma4/repos cd ~/Developer/ML-Models/Gemma4 git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp cd repos/llama.cpp cmake -B build \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_METAL=ON \ -DGGML_ACCELERATE=ON cmake --build build --config Release -j
The build I tested had:
GGML_METAL=ON GGML_ACCELERATE=ON GGML_BLAS=ON GGML_BLAS_VENDOR=Apple
Create a Python environment:
cd ~/Developer/ML-Models/Gemma4 python3.11 -m venv .venv source .venv/bin/activate pip install -U huggingface_hub hf_xet
Download the files:
mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \ mmproj-BF16.gguf \ MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \ --local-dir models/unsloth-gemma-4-26B-A4B-it-GGUF
You should end up with:
models/unsloth-gemma-4-26B-A4B-it-GGUF/ gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf mmproj-BF16.gguf MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf
This is the final server command:
repos/llama.cpp/build/bin/llama-server \ -m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \ --model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \ --mmproj models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ -ngl 999 \ -fa on \ -c 65536 \ --parallel 1 \ --host 127.0.0.1 \ --port 8080
The OpenAI-compatible endpoint is:
http://127.0.0.1:8080/v1
I used a small start_server.sh wrapper so it runs inside tmux:
#!/usr/bin/env bash set -euo pipefail ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SESSION_NAME="${SESSION_NAME:-gemma4-server}" HOST="${HOST:-127.0.0.1}" PORT="${PORT:-8080}" CTX_SIZE="${CTX_SIZE:-65536}" PARALLEL="${PARALLEL:-1}" LLAMA_SERVER="$ROOT_DIR/repos/llama.cpp/build/bin/llama-server" MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf" DRAFT_MODEL="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf" MMPROJ="$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf" LOG_FILE="$ROOT_DIR/logs/llama-server-mtp.log" mkdir -p "$ROOT_DIR/logs" tmux new-session -d -s "$SESSION_NAME" -c "$ROOT_DIR" \ "$LLAMA_SERVER \ -m '$MODEL' \ --model-draft '$DRAFT_MODEL' \ --mmproj '$MMPROJ' \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ -ngl 999 \ -fa on \ -c '$CTX_SIZE' \ --parallel '$PARALLEL' \ --host '$HOST' \ --port '$PORT' \ 2>&1 | tee -a '$LOG_FILE'"
Start it:
chmod +x start_server.sh ./start_server.sh
Check that the server is running:
curl http://127.0.0.1:8080/v1/models
Pi reads model providers from:
~/.pi/agent/models.json
Add a local provider:
{ "providers": { "gemma4-local": { "name": "Gemma 4 Local", "baseUrl": "http://127.0.0.1:8080/v1", "api": "openai-completions", "apiKey": "local", "authHeader": false, "compat": { "supportsDeveloperRole": false, "supportsReasoningEffort": false }, "models": [ { "id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf", "name": "Gemma 4 26B-A4B Q4 + MTP", "reasoning": false, "input": ["text", "image"], "contextWindow": 65536, "maxTokens": 8192, "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } } ] } } }
The important pieces are:
baseUrl points to the llama.cpp OpenAI-compatible server.api is openai-completions.authHeader is false, because this is a local server.input includes both text and image, otherwise Pi treats it as text-only.Optionally make it the default in:
~/.pi/agent/settings.json
{ "defaultProvider": "gemma4-local", "defaultModel": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf", "defaultThinkingLevel": "minimal" }
Then check Pi can see it:
pi --offline --list-models gemma
Expected:
provider model context max-out thinking images gemma4-local gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf 65.5K 8.2K no yes
Run Pi using the local model:
pi --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
Or use non-interactive mode:
pi -p --provider gemma4-local --model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \ "Explain what this repository does"
For screenshots:
pi -p @"/path/to/screenshot.png" "Describe this image and point out anything relevant to the UI"
The final local coding-agent stack was:
| Layer | Choice |
|---|---|
| Inference runtime | llama.cpp |
| macOS acceleration | Metal + Accelerate |
| Main model | gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf |
| Draft model | gemma-4-26B-A4B-it-Q8_0-MTP.gguf |
| MTP setting | --spec-draft-n-max 3 |
| Multimodal projector | mmproj-BF16.gguf |
| Server | llama-server on 127.0.0.1:8080 |
| API | OpenAI-compatible /v1 |
| Coding agent | Pi |
| Pi model input | ["text", "image"] |
The main conclusion was that the MTP draft model is worth using. On this machine it took Gemma 4 from 58.2 tokens/second to 72.2 tokens/second, while keeping the setup simple enough to run as a local OpenAI-compatible server.
P.S: Some suggested using Qwen3.6 35B-A3B instead of Gemma 4 26B-A4B. According to the benchmarks I can find, Qwen is a much better coding agent than Gemma 4.
However, it is also slower. Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf + unsloth-Qwen3.6-35B-A3B-MTP-GGUF + mmproj-BF16.gguf results in 55 tk/s, instead of 72 tk/s. Which is quite significant when you are sitting waiting for it.
Download the models:
mkdir -p models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF \ Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \ mmproj-BF16.gguf \ --local-dir models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF
Start the server:
LLAMA_SERVER=/Users/kylehowells/Developer/ML-Models/Gemma4/repos/llama.cpp/build/bin/llama-server $LLAMA_SERVER \ -m models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \ --mmproj models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ -ngl 999 \ -fa on \ -c 65536 \ --parallel 1 \ --host 127.0.0.1 \ --port 8081
Pi Config:
{ "providers": { "qwen36-local": { "name": "Qwen3.6 Local", "baseUrl": "http://127.0.0.1:8081/v1", "api": "openai-completions", "apiKey": "local", "authHeader": false, "compat": { "supportsDeveloperRole": false, "supportsReasoningEffort": false }, "models": [ { "id": "Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf", "name": "Qwen3.6 35B-A3B Q4 + MTP", "reasoning": true, "input": ["text", "image"], "contextWindow": 65536, "maxTokens": 8192, "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } } ] } } }