Ollama is now powered by MLX on Apple Silicon in preview

On-device models are the future. Users prefer them. No privacy issues. No dealing with connectivity, tokens, or changes to vendors implementations. I have an app using Foundation Model, and it works great. I only wish I could backport it to pre macOS 26 versions.

I created "apfel" https://github.com/Arthur-Ficial/apfel a CLI for the apple on-device local foundation model (Apple intelligence) yeah its super limited with its 4k context window and super common false positives guardrails (just ask it to describe a color) ... bit still ... using it in bash scripts that just work without calling home / out or incurring extra costs feels super powerful.

LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.

Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.

What is the cheapest usable local rig for coding ? I dont want fancy agents and such, but something purpose built for coders, and fast-enough for my use, and open-source, so I can tweak it to my liking. Things are moving fast, and I am hesitant to put in 3-4K now in the hope that it would be cheaper if i wait.

Why are people still using Ollama? Serious.

Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.

I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?

Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path

This is excellent news!

What I'm waiting for next is MLX supported speech recognition directly from Ollama. I don’t understand why it should be a separate thing entirely.

How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/

On a M4 Pro MacBook Pro with 48GB RAM I did this test:

ollama run $model "calculate fibonacci numbers in a one-line bash script" --verbose

  Model                         PromptEvalRate EvalRate
  ------------------------------------------------------
  qwen3.5:35b-a3b-q4_K_M         6.6            30.0
  qwen3.5:35b-a3b-nvfp4         13.2            66.5
  qwen3.5:35b-a3b-int4          59.4            84.4

I can't comment on the quality differences (if any) between these three.

Two things: 1) MLX has been available in LM Studio for a long time now, 2) I found that GGUF produced consistently better results in my benchmarking. The difference isn't big, but it's there.

is local llm inference on modern macbook pros comfortable yet? when i played with it a year or so ago, it worked fairly ok but definitely produced uncomfortable levels of heat.

(regarding mlx, there were toolkits built on mlx that supported qlora fine tuning and inference, but also produced a bunch of heat)

still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram

How does this compare to llama.cpp in terms of performance?

How does Ollama help with Claude Code? Claude code runs in terminal but AFAIK connects back to anthropic directly and cannot run locally. I hope I'm missing something obvious.

What are significant differences between Ollama and LM Studio now? I haven’t used Ollama because it was missing MLX when I started using LLM GUIs.

What would be the non Mac computer to run these models locally at the same performance profile? Any similar linux ARM based computers that can reach the same level?

> Please make sure you have a Mac with more than 32GB of unified memory. Time for an upgrade I guess. If I can run Qwen3.5 locally than it is time to switch over to local first LLM usage.

Does that mean they are now finally a bit faster than llama.cpp? Cannot believe that.

Get turboquant 4 bit implemented and this would be game changer.

> Please make sure you have a Mac with more than 32GB of unified memory.

Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .

I used today, working nicely.

Much of the discussion here is local versus remote. I like seeing things as "and" and "or." There will be small things I don't want to burn my Claude tokens on and other things that I want to access larger compute resources. And along the way checking results from both to understand comparative advantage on an ongoing basis.

As being on the market for a new mac and comparing refub M4 Max vs M5 _Pro_, I am interested in how much faster the neural engines are -- compared to marketing claims.

Finally! My local infra is waiting for it for months!

Works really great with https://swival.dev and qwen3.5.

Really nice to see this!

What is the difference between Ollama, llama.cpp, ggml and gguf?

We've been using MLX-LM directly (not via Ollama) for a desktop coding agent project and the performance on M-series chips has been genuinely impressive. Qwen 4B at full MLX speed is fast enough to be useful in an interactive loop — not instant, but not painful either.

The thing I've found with MLX vs llama.cpp is that the memory efficiency story is much better on unified memory machines. With llama.cpp you're fighting the CPU/GPU split; with MLX it just uses the whole pool. Made a meaningful difference for us running 4B models alongside an Electron app.

Curious whether the Ollama MLX backend exposes any controls for cache management or whether it's abstracted away entirely. That's been one of the trickier parts of tuning for our use case.

"We can run your dumbed down models faster":

#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.

The Foundation Model point is real. As an iOS developer, what excites me most isn't the performance — it's what on-device inference does to the app architecture.

When you're not making network calls, you stop thinking in "loading states" and start thinking in "local state machines." The UX design space opens up completely. Interactions that felt too fast to justify a server round-trip are suddenly viable.

The backporting issue is painful though. I've been shipping features wrapped in #available(iOS 26, *) and the fallback UX is basically a different product. It forces you to essentially maintain two app experiences.

Still think this is the right direction — especially for junior devs just learning to ship. Fewer moving parts, less infrastructure to debug.

How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/

What are significant differences between Ollama and LM Studio now? I haven’t used Ollama because it was missing MLX when I started using LLM GUIs.

> Please make sure you have a Mac with more than 32GB of unified memory. Time for an upgrade I guess. If I can run Qwen3.5 locally than it is time to switch over to local first LLM usage.

Get turboquant 4 bit implemented and this would be game changer.

Finally! My local infra is waiting for it for months!

Works really great with https://swival.dev and qwen3.5.

Curious whether the Ollama MLX backend exposes any controls for cache management or whether it's abstracted away entirely. That's been one of the trickier parts of tuning for our use case.

"We can run your dumbed down models faster":

Neat! I’ve actually been building with AFM, including training some LoRA adapters to help steer the model. With the right feedback mechanisms and guardrails, you can even use it for code generation! Hopefully I’ll have a few apps and tools out soon using AFM. I think embedded AI is the future, and in the next few years more platforms will come around to AI as a local API call, not an authorized HTTP request. That said, AFM is still incredibly premature and I’m experimenting with newer models that perform much better.

this is real neat. I'll give it a spin.

nice project, thanks for sharing.

any plans for providing it through brew for easy installation?

…is it a reference to apfelwein?

Honestly I can't believe Apple put that foundation model product out the door. I was so excited about it, but when I tried it, it was such a disappointment. Glad to hear you calling that out so I know it wasn't just me.

Looks like they have pivoted completely over to Gemini, thank god.

Why are people still using Ollama? Serious.

Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.

`ollama serve` and `ollama run`

The devex is great and familiar to folks who have used Docker. Reading through the Lemonade documentation, it seems like a natural migration, but we're talking about two steps for getting started versus just one. So I'd need a reason to make that much change when I'm happy enough with Ollama.

Serious answer: I don't use it that much, it's what I happened to download like 1.5 years ago, and it works fine. Happy to see what may be a speed boost, and have little interest in switching to something else (unless my situation changes, of course).

i like ollama, mostly because the cli is pretty nice. its desktop app has stupid choices like if a model can support tools then the ui should give me the "search" option but it only shows for cloud models.

i have ran lmstudio for a while but i don't really use local models that much other than to mess about.

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).

Avoid reasoning models in any situation where you have low tokens/second

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.

Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.

They initially messed up this launch and overwrote some of the GGUF models in their library, making them non-downloadable on platforms other than Apple Silicon. Hopefully that gets fixed.

How many tokens per second?

still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram

My super uninformed theory is that local LLM will trail foundation models by about 2 years for practical use.

For example right now a lot of work is being done on improving tool calling and agentic workflows, which tool calling was first popping up around end of 2023 for local LLMs.

This is putting aside the standard benchmarks which get "benchmaxxed" by local LLMs and show impressive numbers, but when used with OpenCode rarely meet expectations. In theory Qwen3.5-397B-A17B should be nearly a Sonnet 4.6 model but it is not.

Doesn't OpenCode supports local models?

How close is this? It says it needs 32GB min?

How does this compare to llama.cpp in terms of performance?

MLX is a bit faster (low double digit percentage), but uses a bit more RAM. Worthwhile tradeoff for many.

What would be the non Mac computer to run these models locally at the same performance profile? Any similar linux ARM based computers that can reach the same level?

Framework Desktop is the closest one with the MAX 385/395 chip. It's mostly about the memory being fast enough rather than just CPU/GPU oomph.

The 64GB model is 2240€ base and the 128GB is 3069€ base + all the stuff you need to add to make it an actual computer.

As a comparison the 64GB Mac Mini is 2499€ here and a 128GB Mac Studio is 4274€.

Not even close. If you want to run this on PC's you need to get a GPU like 5090 but that's still not the same cost per token, and it will be less reliable and use a lot more power. Right now the Apple Silicon machines are the most cost effective per token and per watt.

Intel’s doing interesting things with their Arc GPUs. They’re offering GPUs that aren’t super fast for gaming but are relatively low power and have a boatload of VRAM. The new B70 is half the retail price of a 5090 (probably more like 1/3rd or 1/4 of actual 5090 selling prices) but has the same amount of memory and half the TDP. So for the same price as a 5090 you could get several and use them together.

I wonder if the Snapdragon X Elite already caught up with the Apple's M series in that regard - does anybody know?

> Please make sure you have a Mac with more than 32GB of unified memory.

Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .

> Please make sure you have a Mac with more than 32GB of unified memory.

The lack of proper support for SSD offload (via mmap or otherwise) is really the worst part about this. There's no underlying reason why a 3B-active model shouldn't be able to run, however slowly, on a cheap 8GB MacBook Neo with active weights being streamed in from SSD and cached. (This seems to be in the works for GGML/GGUF as part of upgrading to newer upstream versions; no idea whether MLX inference can also support this easily.)

As being on the market for a new mac and comparing refub M4 Max vs M5 _Pro_, I am interested in how much faster the neural engines are -- compared to marketing claims.

M4 Max is going to be faster.

What is the difference between Ollama, llama.cpp, ggml and gguf?

Ollama is a user-friendly UI for LLM inference. It is powered by llama.cpp (or a fork of it) which is more power-user oriented and requires command-line wrangling. GGML is the math library behind llama.cpp and GGUF is the associated file format used for storing LLM weights.

Ollama on MacOS is a one-click solution with stable obe-click updates. Happy so far. But the mlx support was the only missing piece for me.

> solves the problem of too much demand for inference compared to data center supply

Maybe in the distant future when device compute capacity has increased by multiples and efficiency improvements have made smaller LLMs better.

The current data center buildouts are using GPU clusters and hybrid compute servers that are so much more powerful than anything you can run at home that they’re not in the same league. Even among the open models that you can run at home if you’re willing to spend $40K on hardware, the prefill and token generation speeds are so slow compared to SOTA served models that you really have to be dedicated to avoiding the cloud to run these.

We won’t be in a data center crunch forever. I would not be surprised if we have a period of data center oversupply after this rush to build out capacity.

However at the current rate of progress I don’t see local compute catching up to hosted models in quality and usability (speed) before data center capacity catches up to demand. This is coming from someone who spends more than is reasonable on local compute hardware.

I’ve been using google search AI and Gemini, which I find generally pretty good. In the past week, Gemini and Search AI have been bringing in various details of previous searches I’ve done and Search AI conversations I’ve had and it’s extremely gross and creepy.

I was looking for details about cars and it started interjecting how the safety would affect my children by name in a conversation where I never mention my children. I was asking details about Thunderbolt and modern Ryzen processors and a fresh Gemini chat brought in details about a completely unrelated project I work on. I’ve always thought local LLMs would be important, but whatever Google did in the past few weeks has made that even more clear.

I disagree with every sentence of this.

> solves the problem of too much demand for inference

False, it creates consumer demand for inference chips, which will be badly utilised.

> also would use less electricity

What makes you think that? (MAYBE you can save power on cooling. But not if the data center is close to a natural heat sink)

> It's just a matter of getting the performance good enough.

The performance limitations are inherent to the limited compute and memory.

> Most users don't need frontier model performance.

What makes you think that?

I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!)

There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)

I have journaled digitally for the last 5 years with this expectation.

Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.

It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.

I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.

Obviously apple would prefer this. It would boost demand for more powerful and expensive devices, and align with their privacy marketing. But they have massively fumbled with siri for a long time and then missed huge deadlines with ai promises. Despite having billions, they have shown no competency in delivering services or accurately marketing what to expect from ai features.

"Most users don't need frontier model performance" unfortunately, this is not the case.

Depending on the use case, the future is already here.

For example, last week I built a real-time voice AI running locally on iPhone 15.

One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.

https://github.com/fikrikarim/volocal

It feels like you'll soon need a local llm to intermediate with the remote llm, like an ad blocker for browsers to stop them injecting ads or remind you not to send corporate IP out onto the Internet.

Not sure about the using less electricity part. With batching, it’s more efficient to serve multiple users simultaneously.

this is real neat. I'll give it a spin.

My super uninformed theory is that local LLM will trail foundation models by about 2 years for practical use.

For example right now a lot of work is being done on improving tool calling and agentic workflows, which tool calling was first popping up around end of 2023 for local LLMs.

Doesn't OpenCode supports local models?

You can, but the quality sucks.

Local LLMs don't make sense for most people compared to "cloud" services, even more so for coding.

How close is this? It says it needs 32GB min?

You can run Qwen3.5-35B-A3B on 32GB of RAM sure, although to get 'Claude Code' performance, which I assume he means Sonnet or Opus level models in 2026, this will likely be a few years away before its runnable locally (with reasonable hardware).

I'm reading "more than 32GB of unified memory" to mean at least a 36 GB model.

MLX is a bit faster (low double digit percentage), but uses a bit more RAM. Worthwhile tradeoff for many.

On my M4 Pro MLX has almost 2x tok/s

Framework Desktop is the closest one with the MAX 385/395 chip. It's mostly about the memory being fast enough rather than just CPU/GPU oomph.

The 64GB model is 2240€ base and the 128GB is 3069€ base + all the stuff you need to add to make it an actual computer.

As a comparison the 64GB Mac Mini is 2499€ here and a 128GB Mac Studio is 4274€.

Note though that that a MAX 395 has half the memory bandwidth of a M4 Max chip, and the memory bandwidth is going to be the biggest limiting factor, so you'll likely be getting around half the tokens/second with that Framework Desktop.

It's odd no manufacturer jumped on this wagon to offer a competitive alternative.

I wonder if the Snapdragon X Elite already caught up with the Apple's M series in that regard - does anybody know?

> Please make sure you have a Mac with more than 32GB of unified memory.

M4 Max is going to be faster.

i've found llama.cpp (as i understand it, ollama now uses their own version of this) to work much better in practice, faster and much more flexible.

Ollama on MacOS is a one-click solution with stable obe-click updates. Happy so far. But the mlx support was the only missing piece for me.

Can you please write about your hardware?

nice project, thanks for sharing.

any plans for providing it through brew for easy installation?

Looks like they just added homebrew tap to the instructions

There's a very similar afm CLI that can be installed via Homebrew.

https://github.com/scouzi1966/maclocal-api

…is it a reference to apfelwein?

just german for apple, cause reasons

Looks like they have pivoted completely over to Gemini, thank god.

yeah, it is super limited but also you can now do

  cmd(){ local x c r a; while [[ $1 == -* ]]; do case $1 in -x)x=1;shift;; -c)c=1;shift;; *)break;; esac; done; r=$(apfel -q -s 'Output only a shell command.' "$*" | sed '/^```/d;/^#/d;s/^[[:space:]]*//;/^$/d' | head -1); [[ $r ]] || { echo "no command generated"; return 1; }; printf '\e[32m$\e[0m %s\n' "$r"; [[ $c ]] && printf %s "$r" | pbcopy && echo "(copied)"; [[ $x ]] && { printf 'Run? [y/N] '; read -r a; [[ $a == y ]] && eval "$r"; }; return 0; }

cmd find all swift files larger than 1MB

cmd -c show disk usage sorted by size

cmd -x what process is using port 3000

cmd list all git branches merged into main

cmd count lines of code by language

without calling home or downloading extra local models

and well, maybe one day they get their local models .... more powerful, "less afraid" and way more context window.

In Apple’s defense, they did make it do something borderline useful while targeting a baseline of M1 Macs with 8 GB of RAM (and even less in phones).

Qwen3.5 has tool calling, so you can give it a wikipedia tool which it uses to know what happened in Tiananmen Square without issues =)

Oh does llama.cpp use MLX or whatever? I had this question, wonder if you know? A search suggests it doesn’t but I don’t really understand.

Cool, I always wanted to invade Belgium. Maybe if my plan is good, I could run a successful gofundme?

Not sure about the using less electricity part. With batching, it’s more efficient to serve multiple users simultaneously.

Indeed. Data centers have so many ways and reasons to be much more energy-efficient than local compute it's not even funny.

Looks like they just added homebrew tap to the instructions

In Apple’s defense, they did make it do something borderline useful while targeting a baseline of M1 Macs with 8 GB of RAM (and even less in phones).

`ollama serve` and `ollama run`

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Avoid reasoning models in any situation where you have low tokens/second

They initially messed up this launch and overwrote some of the GGUF models in their library, making them non-downloadable on platforms other than Apple Silicon. Hopefully that gets fixed.

How many tokens per second?

Two things: 1) MLX has been available in LM Studio for a long time now, 2) I found that GGUF produced consistently better results in my benchmarking. The difference isn't big, but it's there.

Does that mean they are now finally a bit faster than llama.cpp? Cannot believe that.

On a M4 Pro MacBook Pro with 48GB RAM I did this test:

ollama run $model "calculate fibonacci numbers in a one-line bash script" --verbose

  Model                         PromptEvalRate EvalRate
  ------------------------------------------------------
  qwen3.5:35b-a3b-q4_K_M         6.6            30.0
  qwen3.5:35b-a3b-nvfp4         13.2            66.5
  qwen3.5:35b-a3b-int4          59.4            84.4

I can't comment on the quality differences (if any) between these three.

This is excellent news!

What I'm waiting for next is MLX supported speech recognition directly from Ollama. I don’t understand why it should be a separate thing entirely.

I used today, working nicely.

Really nice to see this!

Indeed. Data centers have so many ways and reasons to be much more energy-efficient than local compute it's not even funny.

You can, but the quality sucks.

Local LLMs don't make sense for most people compared to "cloud" services, even more so for coding.

I'm reading "more than 32GB of unified memory" to mean at least a 36 GB model.

On my M4 Pro MLX has almost 2x tok/s

It's odd no manufacturer jumped on this wagon to offer a competitive alternative.

i've found llama.cpp (as i understand it, ollama now uses their own version of this) to work much better in practice, faster and much more flexible.

Can you please write about your hardware?

There's a very similar afm CLI that can be installed via Homebrew.

https://github.com/scouzi1966/maclocal-api

done

  brew tap Arthur-Ficial/tap
  brew install Arthur-Ficial/tap/apfel

just german for apple, cause reasons

I thought it was a reference to Wine, the Linux Wine, and then thought of apfelwein. Nvm!

yeah, it is super limited but also you can now do

  cmd(){ local x c r a; while [[ $1 == -* ]]; do case $1 in -x)x=1;shift;; -c)c=1;shift;; *)break;; esac; done; r=$(apfel -q -s 'Output only a shell command.' "$*" | sed '/^```/d;/^#/d;s/^[[:space:]]*//;/^$/d' | head -1); [[ $r ]] || { echo "no command generated"; return 1; }; printf '\e[32m$\e[0m %s\n' "$r"; [[ $c ]] && printf %s "$r" | pbcopy && echo "(copied)"; [[ $x ]] && { printf 'Run? [y/N] '; read -r a; [[ $a == y ]] && eval "$r"; }; return 0; }

cmd find all swift files larger than 1MB

cmd -c show disk usage sorted by size

cmd -x what process is using port 3000

cmd list all git branches merged into main

cmd count lines of code by language

without calling home or downloading extra local models

and well, maybe one day they get their local models .... more powerful, "less afraid" and way more context window.

This really makes me think of A Deepness in the Sky by Vernor Vinge. A loose prequel to A Fire Upon The Deep, and IMO actually the superior story. It plays in the far future of humanity.

In part of it, one group tries to take control of a huge ship from another group. They in part do this by trying to bypass all the cybersecurity. But in those far future days, you don't interface with all the aeons of layers of command protocols anymore, you just query an AI who does it for you. So, this group has a few tech guys that try the bypass by using the old command protocols directly (in a way the same thing like the iOS exploit that used a vulnerability in a PostScript font library from 90s).

Imagine being used to LLM prompting + responses, and suddenly you have to deal with something like

  sed '/^```/d;/^#/d;s/^[[:space:]]\*//;/^$/d' | head -1); [[ $r ]]

and generally obtuse terminal output and man pages.

(offtopic: name your variables, don't do local x c r a;. Readability is king, and a few hundred thousand years from now some poor Qeng Ho fellow might thank his lucky stars you did).

What is the AI doing here? Or is this just like being cheeky?

Qwen3.5 has tool calling, so you can give it a wikipedia tool which it uses to know what happened in Tiananmen Square without issues =)

That's very cool! I think giving it some research tools might be a nifty thing to try next. This is a fairly new area for me, so pointers or suggestions are welcome, even basic ones. :)

Worth adding that I had reasoning on for the Tiananmen question, so I could see the prep for the answer, and it had a pretty strong current of "This is a sensitive question to PRC authorities and I must not answer, or even hint at an answer". I'm not sure if a research tool would be sufficient to overcome that censorship, though I guess I'll find out!

I'd recommend it too, because the knowledge cutoff of all the open weight Chinese models (M2.7, Qwen3.5, GLM-5 etc) is earlier than you'd think, so giving it web search (I use `ddgr` with a skill) helps a surprising amount

Oh does llama.cpp use MLX or whatever? I had this question, wonder if you know? A search suggests it doesn’t but I don’t really understand.

>Oh does llama.cpp use MLX or whatever?

No. It runs on MacOS but uses Metal instead of MLX.

llama.cpp uses GGML which uses Metal directly.

Cool, I always wanted to invade Belgium. Maybe if my plan is good, I could run a successful gofundme?

Hey, if Margaret Thatcher's son can give it a go, why not you? Believe in yourself and reach for those dreams. *sparkle emoji*

I have journaled digitally for the last 5 years with this expectation.

It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.

I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.

Did you get any insights about yourself from this process? I am thinking of doing the same

"Most users don't need frontier model performance" unfortunately, this is not the case.

It depends. If they're using a small/medium local model as a 1:1 ChatGPT replacement as-is, they'll have a bad time. Even ChatGPT refers to external services to get more data.

But a local model + good harness with a robust toolset will work for people more often than not.

The model itself doesn't need to know who was the president of Zambia in 1968, because it has a tool it can use to check it from Wikipedia.

Any citations? Because that was my impression, too. I want frontier model performance for my coding assistant, but "most users" could do with smaller/faster models.

ChatGPT free falls back to GPT-5.2 Mini after a few interactions.

> unfortunately, this is not the case

Most users are fixing grammar/spelling, summarising/converting/rewriting text, creating funny icons, and looking up simple facts, this is all far from frontier model performance.

I've a feeling that if/when Apple release their onboard LLM/Siri improvements that can call out if needed, the vast majority of people will be happy with what they get for free that's running on their phone.

"Hey dingus, set timer for 30 minutes"

eh, its weird how thetech world wants to build trillions of data centers for...what, escapingthe permanent underclass?

I think what "need" youspeak of is a bit of a colored statement.

Depending on the use case, the future is already here.

For example, last week I built a real-time voice AI running locally on iPhone 15.

One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.

https://github.com/fikrikarim/volocal

Brilliant. Hope to see you in the App Store!

I'd like to coin the term "user agent" for this

i have ran lmstudio for a while but i don't really use local models that much other than to mess about.

You can also use OpenWebUI locally which should give you a nice friendly UX once you set it up.

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

OK thanks! That's helpful. I ignorantly assumed simpler prompt == faster first response.

Indeed. Qwen doesn’t just second guess itself, it third and fourth guesses itself.

Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.

Thanks! I assumed simpler == faster, but my ignorance is showing itself.

I am using the model they recommended in the blog post - which I assumed was using MLX?