Qwen 3.6 27B is the sweet spot for local development

I love my MacBook Pro M5 128GB RAM and I love qwen3.6.

BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.

Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.

If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.

Thank me later.

The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.

[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...

None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.

The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.

I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds?

(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)

I see a lot of people writing about how expensive the hardware to run these local models is - but see no mentions of the Intel Arc Pro B50/B60/B70 which seem like decent value if you're not interested in Apple kit (as much as anything can be decent value in the current status quo).

I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.

I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.

I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.

Which one is actually better between Qwen and DeepSeek, and which one costs less?

It's fun to run a model locally, but I don't think the economics make sense for anyone just trying to use models atm. It's absurdly cheap to use the same model via openrouter in comparison.

Seriously, just put $10 into openrouter and play with models that are cheap but bigger than what you'd reasonably be able to run locally like deepseek v4 flash (unquantized). You'll be surprised by how far that $10 goes for a model better than what you'd be able to run. Even further on the model you would be able to run locally. Then think of how many long it would take to match the cost of spend + power on doing it locally...

Before you run and go purchase a unified memory computer (e.g., DGX Spark, Mac, Ryzen AI Max 395 / Strix Halo), be aware dense models generally run slow on these machines. Dedicated GPUs run dense models significantly better. Look for benchmarks for your prospective machine. If you really want one of these, you'll be better off running Qwen 3.6 35B or another sparse MoE model.

I'm having a decently good time time with `qwen3.6-35b-a3b-mtp` (unsloth's multi-token prediction version) and and `qwen-agentworld-35b-a3b`.

On a 2021 M1 Pro (32GB RAM) I can get either of them as `IQ4_NL` quantized models (the first with reduced context, around 160k; the second can do the whole 264k with RAM left over), running something like 30tokens/s.

On a Framework 13 AMD AI HX370 it can use the same, but both on Q8_0 quantization, full context window, parallelism. Speed is just ~15tokens/s so slower, but definitely smarter than the lower quantized siblings.

Both of them are good developer partners for an engineer who wants more of a second pair of eyes and a rubber duck, rather than a model to just do everything for them. Pretty good for my brain dumping, some commit reviews, sanity checks, just always assume that every claim has to be checked and re-checked.

The only problem is really the context loading, that's pretty slow (starts off around 300token/s on empty context, by the time we get to something like 70-80k which is just a bit of repo discovery, it can run around 80 prompt token/s or less, so there's always a lot more waiting around. Local tools need to bump all of their timeouts, and have to be mindful that there's unlikely to be really meaningful parallelism on these machines with local models.

I'm still figuring out how to approach these things, though. Definitely better than glorified autocomplete or search tool (and too slow for the former, pretty decent for the latter). Their limited skill and performance make it more in line with other tools like my IDE or editors, that they are still in the "tools" compartment of my thinking, rather than "independent, cognitively active entities". Which feels like a good thing.

FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.

QAT, MTP, 128k context.

I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

> ... on my Macbook Max M5 128 GB

Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?

Considering the cloud version, all three models compared in the article (Qwen 3.6 35BA3b, 3.6 27B and DeepSeek V4 Flash), have very similar performance[0], BUT on cloud, for some reason DeepSeek V4 Flash is 10-20x cheaper than the Qwen models.

If Qwen models are so much easier to run, why are the providers charging more than V4 Flash?

[0]: https://aibenchy.com/compare/qwen-qwen3-6-35b-a3b-medium/qwe... <-- compare how the three models draw hamsters svgs, lol

I have been running qwen 3.6 35b a3b with opencode on my macbook pro 16" with m3 max and 64gb ram, and it's been great for local planning and coding. To be honest I have been on and off wishing I had future proofed with the 128gb after seeing how powerful 64gb is. On the other hand, I also haven't run up against a wall with a model that is just slightly larger than qwen.

We have have had the same experience (qwen3.6 rocks) when we are evaluating local models for our developers in the Norwegian Government https://github.com/navikt/mlx-workspace

I think the sweet spot right now is 2x 3090s and a pcie 4 motherboard with 64-128 gb of ddr4 ram, you can build this right now for $3k and it runs qwen 27b/35b stupid fast at int4.

My partner has been trying various models on our server but we haven't gotten anything to run at a usable speed. Q30H engineering sample (Xeon 8570) with two cpus, 56 cores per CPU, 768GB DDR5 RAM running at 5600MHz, two old 3090s in it at the moment with an NVLink and we could put our third in there. We built this server before the prices skyrocketed because we happened across some Tyan boards on Woot that were absurdly cheap for what they are (the motherboards should be $1000+ but we got them for a few hundred).

This thing sounds like it should be a monster but we keep running into issues of the old GPU architecture, lack of support for AMX or AMX not being as big of a help as you'd hope when it does work, etc. Apparently we only got 5 tokens per second trying to set up Qwen 3.6 27B, and a similarly bad result trying to run GLM 5.2 which fits in memory but the custom kernels we had to try to contrive were too slow. I feel like this system should have tons of potential, especially if something was designed to let the AMX and huge system memory shine.

Does anyone have any suggestions? This thing was fun to set up and it's really cool but it's been a bit disappointing not getting any big tangible results so far.

We have a similar system on a single-cpu Tyan board with 256GB RAM that I'm hoping we might be able to use in conjunction with the first one if EXO ever gets good Linux support for GPU/RDMA over InfiniBand.

Dual AMD Radeon AI Pro 9700s (600 watts total 64GB of vram) runs Qwen 3.6 27B at FP8 with mtp on vLLM at 50ish TPS for decode. Cards cost $1300 a piece. Enough KV cache to fully max out two concurrent sessions.

It was super rough going to get started with them back in January, but right now the cards purrrr and I haven't even tried tuning yet. You need to use a patched vLLM image with aiter but besides that things are finally working on the ROCm front.

I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?

Running 27B dense model on M5 128GB is ok, but one can do better.

On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.

27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.

I don't understand the talk about how expensive the hardware is. These models can run on very old or old and low end. I've been running Qwen3.6-35B Q4 on an old 1080 GPU(8GB vram) with 32GB sys RAM. I have a i7-12700.

It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.

I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.

These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.

My personal experience below:

I ran into some small problems with codex during setup and, for a few reasons, did not want to set up a cli shell with them at the time. Since I was not doing anything really serious, but just exploring a half-baked idea for an android app, I ran qwen in lms and connected it to android studio.

None of the mini projects that I have attempted ( more granular call control, silly html scrolling game, music play app ) were one shots despite carefully preparing the prompt ahead of time. Admittedly, some of it may have something to do with android studio, but I did not try it with google account yet. All took between an hour to four to generate ( prep, initial run, testing, iteration and so on ).

If it helps, miniforum AI MAX 395. I am not saying it is bad. Quite the opposite, but you want to be aware of the limitations though and plan around those.

Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.

> What it does:

> --jinja for tool calling support

Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year

And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.

I've been working with local models for the past year. There's so many possibilities, but I don't think coding is one. Coding requires so many layers beyond inference; I spent so much time trying to replicate what Claude Code does end to end locally. Understanding all the layers and keeping up with the advancements feels like a slog. Even this article messes up and misunderstands what some of the settings are doing. Qwen in particular seems to work at first, then often gets stuck in thought loops when used for actual work.

However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.

Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.

Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.

Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.

While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.

Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.

Since no one else posted it... I have open-webui pointed at a linux box with 128 gig of ram and an RTX Pro 6000, and after a couple of runs on trivia, had it do one of Open WebUI's conversation starters: "Show me a code snippet of a website's sticky header in CSS and JavaScript."

72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.

That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)

I have a fairly beefy M4/48G but I haven't been able to get any local model to behave anywhere near satisfactorily.

I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.

Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong

I've worked extensively with the slightly less able cousin, the 35B A3B model and tuned my own harness around making it work well with local or non-sota models. The results are quite promising [0], if one sticks to a plan-execute approach. After a bit of fiddling with llama.cpp I was able to get it to work through a small change on a real codebase from work on a 32GB M5 (typical python FastAPI backend, so nothing out of the ordinary). While that's somewhat encouraging, the whole local experience was still far from pleasant with all the noise and heat.

[0] https://deepclause.substack.com/p/how-to-make-small-models-p...

i have been trying several open source models for the last few years. running qwen 3.6 27b on my 4090 is the first local llm i have used that made me start to second question if anthropic and openai are actually worth the (already) insane valuations.

don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.

Running LLMs locally for development doesn’t make sense to me. The hardware gets outdated in just a few years. Even hyperscalers replace their GPUs faster than they can buy them, plus the cost of running it locally, isn’t cheap. the cost saving just ain't there.

Something I find really confusing from this post is the MLX versions of the model running much slower. As I understand it, these model versions are meant to take advantage of Apple Silicon and MacOS APIs, and should produce better/faster results. Any insight into what’s happening here?

On dual rtx3090 it runs at 140tok/s with a short prompt... Not bad.

Qwen 3.6 dense runs at 40tok/s

A lot of replies here are about Mac devices and their support for these 27B models. I own a MacBook but use a Lenovo Thinkstation PGX to run my models. It has a gb10 Blackwell gpu and 128gb unified memory. You can connect multiple ones.

Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?

I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).

And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.

What model fits on 36GB RAM mac?

We need machines designed around wide memory + sustained inference thermals, not gaming/creator chassis we're borrowing. Until then "local dev" means clamshell + external fans.

I think things are moving fast, tested that new vibethink-3B, works on many small tasks/fast, and playing with ornith-35B with a draft vibethinker-3b as a draft gave me some good speed/results.

Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.

When is Amazon Bedrock going to get these newer models?

Offloading compute to them is much easier, except its still a limited set of open models. Most companies are already running in AWS, so it's an easy add, models run in a trusted location, just another line item on the Amazon bill. You don't have to talk anyone into signing up with a new vendor. Plus you don't have to worry about local hardware at all.

This is probably the first small model I got through some simple web game tests without having to reset the context. It tends to opt to overwrite an entire file instead of doing edits... which editing is where most of these small models fall apart along with getting stuck in repeating loops. Only 24k tokens in so far, it did some decent newbie work.

Has anyone considered a home server? Assuming mobility is not important if we pick components to match a similar hardware would it be more value for money?

I can come close to agreeing because queen-3.6-27b is my second favorite for local coding. I am using gemma4:26b-a4b-it-qat-48k (the "-48k" is from my modifying a model run with Ollama to always use a 48K context size). On a 32G Mac I use gemma4:26b-a4b-it-qat-48k and OpenCode and on my 16G MacBook Air I use gemma4:12b-it-qat-16k ("-16k" is my resizing context size) and little-coder. I break up projects into small libraries because local coding works better for me using small code bases.

I find that for local coding, I need to spend a lot of time building concise SKILLs for specific things I work on and try to only enable one or two skills per coding session.

To the author of the linked article nice job, and if you feel like adding to it, please add details on your setup.

What do folks use to keep on top of new model releases that are appropriate to their system? i.e. the models that will actually work on the MacBook Pro with 48GB of RAM or whatever their specs are.

I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.

I've been using it with a couple of tools (like context7) as a documentation/helper, without giving it direct access to writing code, in marimo. it works great, albeit a little slow on my server (m1 max 64gb ram), at 8bit with omlx

Qwen3.6 was the first model I ran locally that seemed smart, but qwen3-coder:30b is way, way more responsive and more accurate for writing code according to my tests, including human-eval. If you can run one than you can almost certainly run the other. If you haven't tried qwen3-coder I would definitely recommend it. It feels positively snappy compared to every other local model I've tried. All you need is 32G VRAM and some heat dissipation.

Been running it on a 9950x3D with 96GB and a 4090. Speedwise it is fine. Bit while not completely useless, for software development it is unsurprisingly a dramatic downgrade from the Opus I use as my daily driver.

why does everyone imply you need a $10k laptop which then starts burning when you run Qwen 3.6? Get any other system with enough VRAM for a third of the price. Framework Desktop (Strix Halo 128GB) still costs under 4k nowadays, is nearly silent even on 100% GPU + CPU. (also it gets only slightly 'warm', but with a desktop you don't care anyway, I guess).

I was interested to see that Qwen3.5-122B-A10B narrowly beat Qwen3.6-27B on Donato Capitella's SWEBench-verified-mini run with a similar 128GB UMA architecture.

https://pi-local-coding-bench.dev

The open source models have gotten heavily conflated with local development. While that is cool and I'm excited about the future of local LLMs, it is not necessary to play around with these models. Without shilling for companies I don't have a relationship with, there are a number of companies who will give you an API just like Anthropic/OpenAI and you pay per token, albeit much cheaper than the frontier labs.

I've been using the full GLM 5.2 model this way (through opencode) at work for the past week. It's quite impressive.

27-30B in general seems to be the level where you actually start having decent models. I just wish consumer hardware hadn't stagnated so much that we can't easily go higher than that, and that even running those requires a $5k machine now.

Which one is actually better between Qwen and DeepSeek, and which one costs less?

We have have had the same experience (qwen3.6 rocks) when we are evaluating local models for our developers in the Norwegian Government https://github.com/navikt/mlx-workspace

My personal experience below:

If it helps, miniforum AI MAX 395. I am not saying it is bad. Quite the opposite, but you want to be aware of the limitations though and plan around those.

> What it does:

> --jinja for tool calling support

Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year

72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.

I have a fairly beefy M4/48G but I haven't been able to get any local model to behave anywhere near satisfactorily.

I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.

Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong

don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.

On dual rtx3090 it runs at 140tok/s with a short prompt... Not bad.

Qwen 3.6 dense runs at 40tok/s

What model fits on 36GB RAM mac?

We need machines designed around wide memory + sustained inference thermals, not gaming/creator chassis we're borrowing. Until then "local dev" means clamshell + external fans.

I think things are moving fast, tested that new vibethink-3B, works on many small tasks/fast, and playing with ornith-35B with a draft vibethinker-3b as a draft gave me some good speed/results.

When is Amazon Bedrock going to get these newer models?

What do folks use to keep on top of new model releases that are appropriate to their system? i.e. the models that will actually work on the MacBook Pro with 48GB of RAM or whatever their specs are.

I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.

I love my MacBook Pro M5 128GB RAM and I love qwen3.6.

BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.

Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.

Thank me later.

The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.

[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...

> Being able to nail a zero-shot greenfield project is relatively easy even for a small model

Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.

> and it can fall back to similar examples in the training data easily.

This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.

My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.

Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.

If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.

The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.

I have been using pi (and previously the codex cli) with Qwen 3.6 27b with 100k context for my development at work, and I have been very blown away by how well it works. It's not perfect, but it's enough to accelerate my normal development flow. I mostly use it for writing Go and C#.

In my experience, even with basic project concepts the small models struggle to spin up greenfield stuff. There's just too many decisions to be made and they're not good at that.

Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.

There are several general types of tasks that a Gemma 4 12B class model works for me, including: 1) design a large project composed of small libraries that can be coded and tested in isolation. 2) clean up old coding projects: add README files, comment code, show an example of using a new API and have it update API use, etc.

All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.

Exactly. If the repo has all of the knowledge living inside of it that window fills up fast, even when using something like codegraph.

I don't use local models but have you tried augmenting the model with code intelligence MCPs like https://github.com/DeusData/codebase-memory-mcp ?

> In my limited experiments Qwen 3.5 (maybe 3.6 is loads better)

1. Maybe you should tell us what those limited experiments are.

2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.

3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.

Well, I can tell you how my thinking goes: 1) I don't buy my computer just to run LLMs and there are many scenarios where I benefit from both a decent GPU and from a large amount of RAM, 2) I run a solo-founder business which owns exactly one computer in the entire company so it might as well be a good one, and 3) I don't need a new car, so comparing pricing this way is irrelevant.

In other words, yes, buying this kind of machine only to run an LLM locally doesn't make sense, because local LLMs generally still suck for serious programming work (they work great for spam filtering though!). But more generally this machine makes sense for a lot of people.

I also don't understand why people in this price bracket are buying Mac laptops instead of desktop computers with GPUs? Just to flex that it's portable?

I think it's silly to go for a laptop form factor. Last fall I put together a workstation with two second-hand 3090s in it (paid $850CDN each, now the best I can find is $1200). With 48GB VRAM it's reasonable - and I've been using Qwen 3.6 27B for various tasks around building KGs from text corpora / reasoning about them.

I've ran comparisons against everything that's available on OpenRouter (well, as of few weeks ago), and for $0/tok, the local 27B Qwen can't be beat. Sure, it's slower, and yeah, the office is a few degrees warmer than it ought to be -- but nobody can pull the plug, nobody is watching over my shoulder, and the results are on par with SOTA.

Can't wait for a similarly sized Qwen 3.7 - from what I've seen so far, it's a leap ahead of the previous version.

If your workflow benefits from the speed it quickly pays for itself when factoring in developer salaries here in the US. I recently switched companies and they bought me an M5 Max 128GB as my dev machine.

Builds and local test runs are 3 times faster than the Windows laptop option. The machine will pay for itself just based on that within 3 months. I can spin up a local kubernetes cluster and do full integration tests while I am working on other things as well.

It isn’t a strictly Mac vs Windows thing though. It looks like the culprit is the MDM software on the Windows machines is just crazy slow and constantly getting in the way.

If I was paid less it would definitely make less sense for the company to pay for this machine.

It’s an asset on my balance sheet that’s already appreciating nicely and will likely be resale-able for what I paid for it for the next 7-10 years. I am on an Apple monthly installment plan so $5k is $416/month for 1 year, no interest. I’m able to run DS4 scale models and other open models without quantization, often multiple at once.

Imagine its value if war broke out over Taiwan / Greater China, or really any of the dark scenarios with global connectivity or the truthiness of commercially available models. It is a very, very difficult piece of equipment to make at any other moment in history. I wish I could have purchased more. I saw the signs and price trends and out of stocks as they unfolded. No doubt others with the means are stockpiling.

> Are developers in other countries living in such different worlds?

Yes. Your people earn an order of magnitude less income than Americans.

Yes they are, 6k is peanuts to a lot of people.

It's not always about the price or being the cheapest. For me, it's about freedom, both to play and from the govt/corp censorship.

> Are developers in other countries living in such different worlds?

Yes. Back in the my days at $faang in europe it was not uncommon to hear people getting 120-160 k€/year in compensation and we were “poor” compared to us engineers at the same faang (4-500 k$/year total compensation) with a bit of seniority…

A 3090 never came with 32gb of vram

I try to always mention that AMD ROCm has come a long way. Like the B70 the Radeon AI Pro 9700 has 32GB of DDR6 640GB/s. Also $1300 a card. Very capable cards now in mid 2026. Great for dense models in the 30B range. I'd go strix halo or DGX spark if you want to run the 120B range of MOE models.

I got B70 few days ago. Running on CachyOS. 9070XT on PCIe x16 and B70 on the x4.

ROCm nightly was pretty easy to setup and get up running. The 9070XT has been a decent card for my use cases.

But the SYCL ecosystem versions. Absolutely horrendous and everything is hundred commits behind. Vulkan is probably the only way forward with this card.

Interesting that Intels latest consumer GPUs only have 10 and 12GB respectively for the B570 and B580.

It's fun to run a model locally, but I don't think the economics make sense for anyone just trying to use models atm. It's absurdly cheap to use the same model via openrouter in comparison.

Even with deepseek v4 flash I burned though $5 in credits in a day just playing around with Hermes, and qwen 3.6 35B is significantly more expensive.

I can run qwen 3.6 35B on my gaming PC at around 50 tok/s and other than power cost of a tiny bit extra per month, it's hardware I already owned from years ago.

I'm not really sure why qwen 3.6 35B is so expensive on openrouter, it seems abnormally high for what hardware it takes to run it.

If you're not good at prompting yet, that $10 doesn't go very far. The local model allows me to learn what works and what doesn't without paying for tokens. Then when I know how not to waste them, I'll try a paid model.

There is one side effect of running your LLM locally: you stop thinking about the token budget. I often run `/goal` with no limits, or script an endless loop in bash to run opencode, etc. Sometimes I just brute force the task by throwing a /goal at it. Maybe it's not the most efficient use either, but it's nice to have the option.

Agreed, I'm waiting for the time when 48GB+ ram is just the standard that computers come with rather than being the absolute top tier option. It just doesn't make sense to spend extra on a local AI computer right now when the same money would last for a decade of API pricing.

Those are all pre-rugpull prices though. Give it a year.

I'm having a decently good time time with `qwen3.6-35b-a3b-mtp` (unsloth's multi-token prediction version) and and `qwen-agentworld-35b-a3b`.

what are you using agentworld for?

FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.

QAT, MTP, 128k context.

I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models. I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.

I can't Gemma4 to actually finish a turn properly, it's always ending abruptly or making malformed tool calls. It's probably something I've misconfigured in oMLX or Opencode.

Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.

> ... on my Macbook Max M5 128 GB

Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?

You don't need nearly that much RAM to run Qwen 3.6 27B, though. qwen3.6:27b-q4_K_M is only 17GB, for example.

A 27B model can fit easily on a 32GB VRAM card (e.g. 5090) or a 32GB computer in RAM at FP8/Q8 (unsloth have 28.6GB Q8 files).

For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.

I'm on 128GB ram strix halo, bought framework desktop for a few thousand CAD back when everyone was calling framework desktop overpriced

It wasn't $10k a month ago

I work with a lot of 3D graphics and geo stuff so I can hit the ceiling with my 48 GB mac. It's not all LLM work. I prioritized more storage than RAM with my budget. Being able to run local llms has greatly helped me understand how they work. For day to day dev I pay for Gemini or Claude.

Think commercial. My company invested in a local rig since privacy is important to our customers and sometimes I want to use these models on private data.

Qwen3.6 runs great on GPU with 24GB VRAM. You could get used 3090 for it.

Certainly won't work on my M4 Pro with 24GB lol

If Qwen models are so much easier to run, why are the providers charging more than V4 Flash?

[0]: https://aibenchy.com/compare/qwen-qwen3-6-35b-a3b-medium/qwe... <-- compare how the three models draw hamsters svgs, lol

Also confused by this. Deepseek V4 flash is so much better than Qwen 3.6 yet cheaper to use.

I've also been running Qwen 3.6 35B A3b on my Windows laptop (64 GB RAM, a 4GB GPU) and it's at least tolerable. It's not fast - a few tokens per second, slower than reading speed - but I can give it a task and come back later. That was a $600 laptop off eBay a few years ago, not a $6,000 machine.

Are these unified memory Macs and giant 24GB desktop GPUs achieving dozens or hundreds of tokens per second commensurate with their 10x-20x cost?

What is the speed on responses? (t/s)

The full 128GB is surely helpful in keeping browsers, editors and other things running since even 20-35GB models + k/v caches can eat up a lot of the core 64GB in my experience.

I think the sweet spot right now is 2x 3090s and a pcie 4 motherboard with 64-128 gb of ddr4 ram, you can build this right now for $3k and it runs qwen 27b/35b stupid fast at int4.

I know how to build PCs but suck at picking parts, would you happen to have a recommended build or links to people who've done similar ones? Heck I'll click on an affiliate link to support the author of the build :-)

Does anyone have any suggestions? This thing was fun to set up and it's really cool but it's been a bit disappointing not getting any big tangible results so far.

Yes, this should be a monster machine. Ampere is an older generation, so I expect that's where some of your issues have been

Start with a quant, you can run the Qwen 27B model at 4-bit on one 3090, presumably 6/8-bit on 2x3090.

Agreed. I have a single 9700 and I'm able to fit Q6 27B at 30tps or Q5 35B at 100tps very easily via llamacpp running vulkan.

The results are impressive considering the amount of people trashing AMD and still trying to recommend 3090s. I hope to buy a 2nd one at some point, but I also hate the version hell of vLLM, the R9700, the ROCM version, and Qwen3.6 all not agreeing with each other. I haven't gotten vLLM to run properly for Qwen3.6, since the version that runs on a 9700 doesn't support 3.6 yet.

I'm trying to quickly hack out a optimized path for just Qwen3.6 to run against rocm natively (e.g. my own inference server for 9700s basically) and see if it can perform better than llamacpp vulkan's results.

Word of caution - the last llamacpp with good performance was b9209 from a month ago. After that, for some reason, vulkan performance dropped by 10x, which has made me lose confidence in llamacpp in the long run.

Having said all that, 3x is 96GB for 4k and peak 900 watts. A 96GB Blackwell is $12k and peak 600 watss. And they will have a similar memory throughput (minor negative to the AMD cards for split processing). It's crazy how price efficient the r9700 is compared to the Nvidia cards.

I posted this elsewhere, but Unsloth says the 27B model should run in 18GB. That leaves little RAM for other tasks, but it depends on your tolerance for slowness I suppose. I haven’t tried it in 24GB so report back if you do.

https://unsloth.ai/docs/models/qwen3.6

You might be interested in Ornith 1.0 9B, which is a new intriguing post-training of Qwen 3.5 9B.

Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.

I don't know about 48GB but 64GB should be enough.

Running 27B dense model on M5 128GB is ok, but one can do better.

This is discussed in the article:

"My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge."

Works beautifully on a 3090, very usable speed. Don't expect Opus 4.8-level performance, but there are some things you just need to keep local.

"DeepSeek-V4-Flash will fit" At Q2, 2bit? Lobotomized to death.

It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.

I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.

These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.

The MoE models hold up better on old hardware, but the dense models like this post promotes are in fact better. This isn't unique to Qwen. Are the dense models better-enough to use given the performance costs? It depends on what you are doing.

If a model runs fast enough for your use case and does exactly what you need it to, then you don't need a much slower model that might be more accurate. If you do anything more complicated, the dense models become more necessary and they are much more computationally heavy by comparison.

On your hardware an Unsloth quant of Gemma 4 26BA4B QAT would likely give you better results, but because it has 4B active parameters instead of Qwen's 3B active parameters, it will probably run slower.

Mind sharing the command line you use to rig it up?

Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.

Its so funny, these "toy models" would be the wet dreams of researchers not 5 years ago.

Progress marches without mercy.

Hello, it's the internet calling, today is that day.

https://github.com/ikawrakow/ik_llama.cpp

Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.

You need it to run in about 8 GB so you have extra space for the context window.

They can be ran on 32GB with 8GB VRAM. I don't think these will be on 16GB for a while. (35B MoE)

And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.

It will run (somewhat slowly) on a five year old M1 Max with 64GB RAM.

Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.

I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.

I would really like to see a 12B or 16B Qwen 3.6.

I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.

Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.

However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.

Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.

Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.

While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.

> I don't think coding is one

Certainly this is falsifiable easily by any of us doing it on a regular basis

> Qwen stuck in thought loops

This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this

[0] https://deepclause.substack.com/p/how-to-make-small-models-p...

What harness are you using?

From the perspective of LLM inference, you currently mostly care about:

- Memory bandwidth; BUT the requirements are currently capped because models have stopped growing at around 1-1.5 trillion parameters for quite a while now. You only need more bandwidth if you're optimizing for the highest possible concurrency (i.e. you're a cloud provider). Also, MoE exists.

- Support for native low-precision math (like FP4 and FP8); BUT once your GPU supports native FP4 (Blackwell+), there's generally no reason for GPUs to go lower because of the obvious quality degradation.

- VRAM capacity - just like memory bandwidth, it's practically capped by 1-1.5 trillion parameter models and is unlikely to need much more in the near future. Also, the current trend is toward miniaturization: modern 30B-class models (which require far less VRAM), now completely destroy 200B-class models from just two years ago on most tasks. We also have better understanding now how to compress contexts.

Most model improvements currently seem to come from RL/harness-based methods, not from scaling models or running new algorithms that require fundamentally new GPUs.

So I don't see why GPUs that exist today must become "outdated" in a few years. They'll be seen as outdated by hyperscalers because they need to serve the maximum number of users as cheaply as possible, so of course they'll replace their GPUs with newer ones that have higher memory bandwidth or more tensor cores. But you don't need that for local inference.

3090 was released six years ago and is still very relevant for running models locally.

Qwen 3.6 35B runs on 32GB with a 1080. That GPU is from 2017.

Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?

And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.

Has anyone considered a home server? Assuming mobility is not important if we pick components to match a similar hardware would it be more value for money?

I find that for local coding, I need to spend a lot of time building concise SKILLs for specific things I work on and try to only enable one or two skills per coding session.

To the author of the linked article nice job, and if you feel like adding to it, please add details on your setup.

I've been using the full GLM 5.2 model this way (through opencode) at work for the past week. It's quite impressive.

I'm surprised no one has else has mentioned - low power mode.

With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.

If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.

I almost always keep my laptop on low power mode.

> MacBook Pro M5 128GB RAM

614 GB/s of memory bandwidth

> MacMini M4 with 64GB of RAM

273 GB/s of memory bandwidth (also only currently available with 48GB)

When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.

And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.

I opted to buy a normal 32GB laptop for this very reason. I know how loud and hot the GPUs in my desktop run when running even smallish models like Qwen 27B or Gemma 4 31B (which is a better model for most than Qwen 3.6, despite the benchmarks). I also have a Strix Halo which doesn't get loud, because it has a single huge fan, but it does get hot. So, there's no way a laptop could work as hard as models make them work, and not be unbearable. Tiny fans trying to remove all that heat? They gotta be screaming. No reason to spend all that money on a laptop that I couldn't realistically make use of. I do run a lot of VMs on my desktop, but I can get to those on a VPN.

It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.

All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.

> The reason is simple: your fingers will burn and your head will explode from the noise.

So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)

I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.

There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).

There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)

But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...

In general if you're setting up a local LLM you should assume it's going to be primarily working as a server and talking to various clients. I use my MBP, but that's because I don't travel much anymore so it can happily work as a server at all times. With the right agent setup you can probably manage most things from your phone even if you don't have a seperate machine to use as a client.

I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).

Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).

I have that model, and do local LLMs and local image generation. DO buy this if you plan on serious local LLM use and enjoy working from anywhere.

Don't expect workstation loads with no fan or heatsink, true. But it's not a real problem, it's still quieter than a desktop.

That said, rather than Mac Mini, if you only work from one place, I'd recommend a Studio Ultra M3 with 512GB. Same or more tokens per second, multiple models loaded. Cool and quiet.

I think there is no reasonably priced machine you could run locally to do serious work with LLMs...

10x rtx6000 Pro in a large workstation is probably the way to go for someone wanting to run GLM5.2.

Other than that it is cloud.

As good as these small models got we are still not "at breakeven" for me.

What is "breakeven" with LLMs? For me it is when I no longer have to read the actual code it wrote. I can trust that if I told it to implement and document a certain architecture it actually did that with no stupid mistakes.

The first model ever that did that for me was the first opus. 4.4 if I remember correctly.

The second model was Gemini 3 Pro preview. For few weeks. Then it was lobotomised. I guess it was too expensive to run and they quantized it too hell.

Only Opus remains. If this GLM model truly rivals even an old opus I'll be very happy when day comes that I'll be able to run it locally.

Nope, have both these machines, can confirm the M5 max blows the M4 mini away. It does get hot, but I use it mostly with an external monitor and keyboard. Conceptually I like the headless model better with a workstation, but work was buying the M5 and can't get it in any other form factor at the monute.

Apple does not sell a 64GB variant of the M4 Mac Mini. IIRC they never have; its always capped out at 48GB.

If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.

running potentially sota open-weight models locally only became a thing in fall 2023.

if a hardware cycle takes ~3 years then fall 2026 would be the first possible device generation where apple exploits its advantage with the unified ram architecture.

more realistically, spring 2027, since they probably also needed some time to make up their minds to lean into that on the top end.

that`s also how i would interpret the recent rumors on m6 and m7.

naturally, the cooling and all that will be optimized around that.

so the first devices that are actually intended and designed for this use case will come at the earliest this fall and more likely in q1/q2 next year.

you are basically paying the price now to be on the bleeding (sweating) edge

The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.

I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.

Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.

The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.

You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).

In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.

[1] https://x.com/MiaAI_lab/status/2070859135399182444

[2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM

I'm running it on my 4070 12gb with 96gb mem, I'm very happy with the results even if I have to wait a couple minutes for results. To me this is far better than I expected and will continue to use it and improve with skills.md. Pi.dev is amazing by the way.

The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.

But the tokens or credits are gone. MacBook stays. You can run other models on the same MacBook. What I read people burn every month on saas… for that money you break even on that MacBook in 5 months.

Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.

That $6700 is a $5000 upgrade over a base model Macbook Pro.

$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)

> The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.

Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.

Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.

For that price you can put together a PC with 128GB of ram ($2000) and an RTX 5090 ($3600) and get 70-100 tokens per second instead of 45

Isn't the directionality important. I.e. it is currently possible to run useful / great models locally, but on high end machines; and in a few years we will likely be able to run even better models on standard machines.