Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All

Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343

I recall a Qwen exec posted a public poll on Twitter, asking which model from Qwen3.6 you want to see open-sourced; and the 27b variant was by far the most popular choice. Not sure why they ignored it lol.

Related interesting find on Qwen.

"Qwen's base models live in a very exam-heavy basin - distinct from other base models like llama/gemma. Shown below are the embeddings from randomly sampled rollouts from ambiguous initial words like "The" and "A":"

https://xcancel.com/N8Programs/status/2044408755790508113

Nice release from the Qwen team.

Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.

I'm thinking about banking and healthcare sector development agencies, for example.

It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

I tried Gemma 4 A4B and was surprised how hart it is to use it for agentic stuff on a RTX 4090 with 24gb of ram.

Balancing KV Cache and Context eating VRam super fast.

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

Fingers crossed for mid and larger models as well. I'd personally love to see Qwen3.6-122B-A10B.

This is surprisingly close to Haiku quality, but open - and Haiku is quite a capable model (many of the Claude Code subagents use it).

I'm a newbie here and lost how I'm supposed to use these models for coding. When I use them with Continue in VSCode and start typing basic C:

  #include <stdio.h>
  int m

I get nonsensical autocompletions like:

  #include <stdio.h>
  int m</fim_prefix>

What is going on?

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?

Honestly, this is the AI software I actually look forward to seeing. No hype about it being too dangerous to release. No IPO pumping hype. No subscription fees. I am so pumped to try this!

Dangit, I'll need to give this a run on my personal machine. This looks impressive.

At the time of writing, all deepseek or qwen models are de facto prohibited in govcon, including local machine deployments via Ollama or similar. Although no legislative or executive mandate yet exists [1], it's perceived as a gap [2], and contracts are already including language for prohibition not just in the product but any part of the software environment.

The attack surface for a (non-agentic) model running in local ollama is basically non-existent . . but, eh . . I do get it, at some level. While they're not l33t haXX0ring your base, the models are still largely black boxes, can move your attention away from things, or towards things, with no one being the wiser. "Landing Craft? I see no landing craft". This would boil out in test, ideally, but hey, now you know how much time your typical defense subcon spends in meaningful software testing[3].

[1] See also OMB Memorandum M-25-22 (preference for AI developed and produced in the United States), NIST CAISI assessment of PRC-origin AI models as "adversary AI" (September 2025), and House Select Committee on the CCP Report (April 16, 2025), "DeepSeek Unmasked".

[2] Overall, rather than blacklist, I'd recommend a "whitelist" of permitted models, maintained dynamically. This would operate the same way you would manage libraries via SSCG/SSCM (software supply chain governance/management) . . but few if any defense subcons have enough onboard savvy to manage SSCG let alone spooling a parallel construct for models :(. Soooo . . ollama regex scrubbing it is.

[3] i.e. none at all, we barely have the ability to MAKE anything like software, given the combination of underwhelming pay scales and the fact defense companies always seem to have a requirement for on-site 100% in some random crappy town in the middle of BFE. If it wasn't for the downturn in tech we wouldn't have anyone useful at all, but we snagged some silcon refugees.

"open source"

give me the training data?

how does this compare to gpt-oss-120b? It seems weird to leave it out.

Anybody use these instead of codex or claude code? Thoughts in comparison?

benchmarks dont really help me so much

China won again in terms of openness

What kind of hardware (preferably non-Apple) can run this model? What about 122B?

Planning to deploy Qwen3.6-35B-A3B on NVIDIA Spark DGX for multi-agent coding workflows. The 3B active params should help with concurrent agent density.

Wowzers, we were worried Qwen was going to suffer having lost several high profile people on the team but that's a huge drop.

It's better than 27b?

What do all the numbers 6-35B-A3B mean?

I tried the car wash puzzle:

You want to wash your car. Car wash is 50m away. Should you walk or go by car?

> Walk. At 50 meters, the round trip is roughly 100 meters, taking about two minutes on foot. Driving would require starting the engine, navigating, parking, and dealing with unnecessary wear for a negligible distance. Walk to the car wash, and if the bay requires the vehicle inside, have it moved there or return on foot. Walking is faster and more efficient.

Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.

Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:

> Double check your logic for errors

Then I created a new dialog and asked the puzzle and it responded:

> Drive it. The car needs to be present to be washed. 50 meters is roughly a 1-minute walk or a 10-second drive. Walking leaves the car behind, making the wash impossible. Driving it the short distance is the only option that achieves the goal.

Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.

How does this compare to the commercial models like Sonnet 4.5 or GPT? Close enough that the price is right (free)?

I'm disappointed they didn't release a 27B dense model. I've been working with Qwen3.5-27B and Qwen3.5-35B-A3B locally, both in their native weights and the versions the community distilled from Opus 4.6 (Qwopus), and I have found I generally get higher quality outputs from the 27B dense model than the 35B-A3B MOE model. My basic conclusion was that MoE approach may be more memory efficient, but it requires a fairly large set of active parameters to match similarly sized dense models, as I was able to see better or comparable results from Qwen3.5-122B-A10B as I got from Qwen3.5-27B, however at a slower generation speed. I am certain that for frontier providers with massive compute that MoE represents a meaningful efficiency gain with similar quality, but for running models locally I still prefer medium sized dense models.

I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.

Does anyone have any experience with Qwen or any non-Western LLMs? It's hard to get a feel out there with all the doomerists and grifters shouting. Only thing I need is reasonable promise that my data won't be used for training or at least some of it won't. Being able to export conversations in bulk would be helpful.

I don't want "Agentic Power".

I want to reduce AI to zero. Granted, this is an impossible to win fight, but I feel like Don Quichotte here. Rather than windmill-dragons, it is some skynet 6.0 blob.

More benchmaxxing I see. Too bad there’s no rig with 256gb unified ram for under $1000

Related interesting find on Qwen.

https://xcancel.com/N8Programs/status/2044408755790508113

A relief to see the Qwen team still publishing open weights, after the kneecapping [1] and departures of Junyang Lin and others [2]!

[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343

This is just one model in the Qwen 3.6 series. They will most likely release the other small sizes (not much sense in keeping them proprietary) and perhaps their 122A10B size also, but the flagship 397A17B size seems to have been excluded.

I really wish they released qwen-image 2.0 as open weights.

I'm a newbie here and lost how I'm supposed to use these models for coding. When I use them with Continue in VSCode and start typing basic C:

  #include <stdio.h>
  int m

I get nonsensical autocompletions like:

  #include <stdio.h>
  int m</fim_prefix>

What is going on?

This is surprisingly close to Haiku quality, but open - and Haiku is quite a capable model (many of the Claude Code subagents use it).

I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac.

How does this compare to the commercial models like Sonnet 4.5 or GPT? Close enough that the price is right (free)?

Nice release from the Qwen team.

Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.

I'm thinking about banking and healthcare sector development agencies, for example.

It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

I tried Gemma 4 A4B and was surprised how hart it is to use it for agentic stuff on a RTX 4090 with 24gb of ram.

Balancing KV Cache and Context eating VRam super fast.

Dangit, I'll need to give this a run on my personal machine. This looks impressive.

Planning to deploy Qwen3.6-35B-A3B on NVIDIA Spark DGX for multi-agent coding workflows. The 3B active params should help with concurrent agent density.

China won again in terms of openness

These are not autocomplete models. It’s built to be used with an agentic coding harness like Pi or OpenCode.

This might sound snarky but in all earnestness, try talking to an AI about your experience using it.

Choose the correct FIM (Fill In the Middle) template for Qwen in Continue. All recent Qwen models are actually trained with FIM capability and you can use them.

I would use something like zeta-2 instead - https://huggingface.co/bartowski/zed-industries_zeta-2-GGUF

Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

can you elaborate? you can use quantized version, would context still be an issue with it?

The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.

If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):

https://qwen.ai/blog?id=qwen3.6

> Close enough

No. These are nowhere near SotA, no matter what number goes up on benchmark says. They are amazing for what they are (runnable on regular PCs), and you can find usecases for them (where privacy >> speed / accuracy) where they perform "good enough", but they are not magic. They have limitations, and you need to adapt your workflows to handle them.

I think its worth noting that if you are paying for electricity Local LLM is NOT free. In most cases you will find that Haiku is cheaper, faster, and better than anything that will run on your local machine.

> It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

I've said in a recent comment that Mistral is the only one of the current players who appear to be moving towards a sustainable business - all the other AI companies are simply looking for a big payday, not to operate sustainably.

I play with the small open weight models and I disagree. They are fun, but they are not in the same class as hosted models running on big hardware.

If some organization forbade external models they should invest in the hardware to run bigger open models. The small models are a waste of time for serious work when there are more capable models available.

I agree with the sentiment, but these models aren't suited for that. You can run much bigger models on prem with ~100k of hardware, and those can actually be useful in real-world tasks. These small models are fun to play with, but are nowhere close to solving the needs of a dev shop working in healthcare or banking, sadly.

I love the idea of building competitor to open weight models but damn is this an expensive game to play

How true is this? How does a regulated industry confirm the model itself wasn't trained with malicious intent?

The Chinese models are generally pretty good.

> Only thing I need is reasonable promise that my data won't be used

Only way is to run it local.

I personally don’t worry about this too much. Things like medical questions I tend to do against local models though

Personally, I wouldn't trust any foreign or domestic LLM providers to not train on your data. I also wouldn't trust them to not have a data breach eventually which is worse. If you're really worried about your data, run it locally. The Chinese models (Qwen, GLM, etc.) are really competitive to my understanding.

I used Qwen CLI's undescribed “coder_agent” (I guess Qwen 3.5 with size auto-selection) and it was powerful enough to complete 95% of a small hobby project involving coding, reverse engineering and debugging. Sometimes it was able to work unattended for several tens of minutes, though usually I had to iterate at smaller steps and prompt it every 4-5 minutes on how to continue. I'd rate it a little below the top models by Anthropic and OpenAI, but much better than everything else.

> Does anyone have any experience with Qwen or any non-Western LLMs?

I use GLM-5.1 for coding hobby project, that going to end up on github anyway. Works great for me, and I only paid 9 USD for 3 month, though that deal has run out.

> my data won't be used for training

Yeah, I don't know. Doubt it.

I'm broadly curious how people are using these local models. Literally, how are they attaching harnesses to this and finding more value than just renting tokens from Anthropic of OpenAI?

I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.

It's a given that the dense models with comparable size are better. I also proved that in my use case for those two Qwen 3.5 models.

The benchmarks show 3.6 is a bit better than 3.5. I should retry my task, but I don't have a lot of confidence. But it does sound like they worked on the right thing which is getting closer to the 27B performance.

You are right, but this is just the first open-weights model of this family.

They said that they will release several open-weights models, though there was an implication that they might not release the biggest models.

The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!

Probably coming next

I'm guessing 3.5-27b would beat 3.6-35b. MoE is a bad idea. Because for the same VRAM 27b would leave a lot more room, and the quality of work directly depends on context size, not just the "B" number.

What do all the numbers 6-35B-A3B mean?

3.6 is the release version for Qwen. This model is a mixture of experts (MoE), so while the total model size is big (35 billion parameters), each forward pass only activates a portion of the network that’s most relevant to your request (3 billion active parameters). This makes the model run faster, especially if you don’t have enough VRAM for the whole thing.

The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.

The 6 is part of 3.6, the model version. 35B parameters, A3B means it's a mixture of experts model with only 3B parameters active in any forward pass.

3.6 is model number, 35B is total number of parameters, A3B means that only 3B parameters are activated, which has some implications for serving (either in you you shard the model, or you can keep the total params on RAM and only road to VRAM what you need to compute the current token, which will make it slower, but at least it runs)

35B (35 billion) is the number of parameters this model has. Its a Mixture of Experts model (MoE) so A3B means that 3B parameters are Active at any moment.

how does this compare to gpt-oss-120b? It seems weird to leave it out.

OSS-120 is too old to be relevant, and four times the size.

GPT-OSS 120B (really 117B-A5.1B) is a lot bigger. better comparison would be to 20B (21B-A3.6B).

Fingers crossed for mid and larger models as well. I'd personally love to see Qwen3.6-122B-A10B.

That would be really great. Though 3.5 122B is already doing a lot of work in our setup.

Already quantized/converted into a sane format by Unsloth:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Unsloth is great for uploading quants quickly to experiment with, but everyone should know that they almost always revise their quants after testing.

If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.

Some times the launch day quantizations have major problems which leads to early adopters dismissing useful models. You have to wait for everyone to test and fix bugs before giving a model a real evaluation.

Why doesn't Qwen itself release the quantized model? My impression is that quantization is a highly nontrivial process that can degrade the model in non-obvious ways, thus its best handled by people who actually built the model, otherwise the results might be disappointing.

Users of the quantized model might be even made to think that the model sucks because the quantized version does.

How much VRAM does it need? I haven't run a local model yet, but I did recently pick up a 16GB GPU, before they were discontinued.

I sense that I don't really understand enough of your comment to know why this is important. I hope you can explain some things to me:

- Why is Qwen's default "quantization" setup "bad" - Who is Unsloth? - Why is his format better? What gains does a better format give? What are the downsides of a bad format? - What is quantization? Granted, I can look up this myself, but I thought I'd ask for the full picture for other readers.

There's absolutely nothing wrong it insane with a safetensors file. It might be less convenient than a single file gguf. But that's just laziness not insanity

So I can use this in claude code with `ollama run claude`?

lmao they are so fast yooo

I tried the car wash puzzle:

You want to wash your car. Car wash is 50m away. Should you walk or go by car?

Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.

Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:

> Double check your logic for errors

Then I created a new dialog and asked the puzzle and it responded:

Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.

Interestingly, Gemma4-26B IQ4_XS gets it correct:

> This sounds like a logic riddle! The answer is: You should go by car. Here is why: If you walk, you will arrive at the car wash, but your car will still be 50 meters away at home. You can't wash the car if the car isn't there! To accomplish your goal, you have to drive the car to the car wash.

It has the wrong one in thinking. It did think longer than usual:

Direct answer: Walk.

Reasoning 1: Distance (50m is negligible).

Reasoning 2: Practicality/Efficiency (engine wear/fuel).

Reasoning 3: Time (walking is likely faster or equal when considering car prep).

...

Wait, if I'm washing the car, I need to get the car to the car wash. The question asks how I should get there.

...

Wait, let's think if there's a trick. If you "go by car," you are moving the car to the destination. If you "walk," you are just moving yourself.

Conclusion: You should drive the car.

Honestly, this is the AI software I actually look forward to seeing. No hype about it being too dangerous to release. No IPO pumping hype. No subscription fees. I am so pumped to try this!

Same here. I really hope in a near future local model will be good enough and hardware fast enough to run them to become viable for most use cases

More benchmaxxing I see. Too bad there’s no rig with 256gb unified ram for under $1000

Sir, this is 2026. You're not getting any 128GB of RAM for under $1k.

taps the sign

  Unified Memory Is A Marketing Gimmeck. Industrial-Scale Inference Servers Do Not Use It.

"open source"

give me the training data?

The training data is the entire internet. How do you propose they ship that to you

You ARE the training data

What kind of hardware (preferably non-Apple) can run this model? What about 122B?

The 3B active is small enough that it's decently fast even with experts offloaded to system memory. Any PC with a modern (>=8 GB) GPU and sufficient system memory (at least ~24 GB) will be able to run it okay; I'm pretty happy with just a 7800 XT and DDR4. If you want faster inference you could probably squeeze it into a 24 GB GPU (3090/4090 or 7900 XTX) but 32 GB would be a lot more comfortable (5090 or Radeon Pro).

122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell.

You won't like it, but the answer is Apple. The reason is the unified memory. The GPU can access all 32gb, 64gb, 128gb, 256gb, etc. of RAM.

An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token.

I run Gemma 4 26B-A4B with 256k context (maximum) on Radeon 9070XT 16GB VRAM + 64GB RAM with partial GPU offload (with recommended LMStudio settings) at very reasonable 35 tokens per second, this model is similiar in size so I expect similar performance.

I can run this on an AMD Framework laptop. A Ryzen 7 (I dont have Ryzen AI, just Ryzen 7 7840U) with 32+48 GB DDR. The Ryzen unified memory is enough, I get 26GB of VRAM at least.

Fedora 43 and LM Studio with Vulkan llama.cpp

The Q5 quantization (26.6GB) should easily run on a 32GB 5090. The Q4 (22.4GB) should fit on a 24GB 4090, but you may need to drop it down to Q3 (16.8GB) when factoring in the context.

You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.

You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.

The more you run in RAM the slower the inference.

Any good gaming pc can run the 35b-a3 model. Llama cpp with ram offloading. A high end gaming PC can run it at higher speeds. For your 122b, you need a lot of memory, which is expensive now. And it will be much slower as you need to use mostly system ram.

I currently run the qwen3.5-122B (Q4) on a Strix Halo (Bosgame M5) and am pretty happy with it. Obviously much slower than hosted models. I get ~ 20t/s with empty context and am down to about 14t/s with 100k of context filled.

No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.

Anybody use these instead of codex or claude code? Thoughts in comparison?

benchmarks dont really help me so much

In my test case (a feature all models got stuck on a few months ago) it just gets stuck in a thinking loop and never gets anywhere. Not a super amazing test, but it happened a few times in a row, so...

Wowzers, we were worried Qwen was going to suffer having lost several high profile people on the team but that's a huge drop.

It's better than 27b?

Their previous model Qwen3.5 was available in many sizes, from very small sizes intended for smartphones, to medium sizes like 27B and big sizes like 122B and 397B.

This model is the first that is provided with open weights from their newer family of models Qwen3.6.

Judging from its medium size, Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B.

It remains to be seen whether they will also publish in the future replacements for the bigger 122B and 397B models.

The older Qwen3.5 models can be also found in uncensored modifications. It also remains to be seen whether it will be easy to uncensor Qwen3.6, because for some recent models, like Kimi-K2.5, the methods used to remove censoring from older LLMs no longer worked.

I don't want "Agentic Power".

I want to reduce AI to zero. Granted, this is an impossible to win fight, but I feel like Don Quichotte here. Rather than windmill-dragons, it is some skynet 6.0 blob.

Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.

I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.

Some tasks don’t require SOTA models. For translating small texts I use Gemma 4 on my iPhone because it’s faster and better than Apple Translate or Google Translate and works offline. Also if you can break down certain tasks like JSON healing into small focused coding tasks then local models are useful

Idk about everyone else, but I don’t want to rent tokens forever. I want a self hosted model that is completely private and can’t be monitored or adulterated without me knowing. I use both currently, but I am excited at the prospect of maybe not having to in the near to mid future.

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.

I've been largely using Qwen3.5-122b at 6 bit quant locally for some c++/go/python dev lately because it is quite capable as long as I can give it pretty specific asks within the codebase and it will produce code that needs minimal massaging to fit into the project.

I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.

I use LMStudio to host and run GLM 4.7 Flash as a coding agent. I use it with the Pi coding agent, but also use it with the Zed editor agent integrations. I've used the Qwen models in the past, but have consistently come back to GLM 4.7 because of its capabilities. I often use Qwen or Gemma models for their vision capabilities. For example, I often will finish ML training runs, take a photo of the graphs and visualizations of the run metrics and ask the model to tell me things I might look at tweaking to improve subsequent training runs. Qwen 3.5 0.8b is pretty awesome for really small and quick vision tasks like "Give me a JSON representation of the cards on this page".

It’s easy to find a combination of llama.cpp and a coding tool like OpenCode for these. Asking an LLM for help setting it up can work well if you don’t want to find a guide yourself.

> and finding more value than just renting tokens from Anthropic of OpenAI?

Buying hardware to run these models is not cost effective. I do it for fun for small tasks but I have no illusions that I’m getting anything superior to hosted models. They can be useful for small tasks like codebase exploration or writing simple single use tools when you don’t want to consume more of your 5-hour token budget though.

The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)

While they can be run locally, and most of the discussion on HN about that, I bet that if you look at total tok/day local usage is a tiny amount compared to total cloud inference even for these models. Most people who do use them locally just do a prompt every now and then.

The privacy/data security angle really is important in some regions and industries. Think European privacy laws or customers demanding NDAs. The value of Anthropic and OpenAI is zero for both cases, so easy to beat, despite local models being dumber and slower.

I really wish they released qwen-image 2.0 as open weights.

Then who is Rocinante?

Is there any source for these claims?

How many people/hackernews can run a 397b param model at home? Probably like 20-30.

397A17B = 397B total weights, 17B per expert?

I used vLLM and qwen3-coder-next to batch-process a couple million documents recently. No token quota, no rate limits, just 100% GPU utilization until the job was done.

Qwen3.5-9B has been extremely useful for local fuzzy table extraction OCR for data that cannot be sent to the cloud.

The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.

I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.

This might sound snarky but in all earnestness, try talking to an AI about your experience using it.

Choose the correct FIM (Fill In the Middle) template for Qwen in Continue. All recent Qwen models are actually trained with FIM capability and you can use them.

I would use something like zeta-2 instead - https://huggingface.co/bartowski/zed-industries_zeta-2-GGUF

I play with the small open weight models and I disagree. They are fun, but they are not in the same class as hosted models running on big hardware.

I love the idea of building competitor to open weight models but damn is this an expensive game to play

The Chinese models are generally pretty good.

> Only thing I need is reasonable promise that my data won't be used

Only way is to run it local.

I personally don’t worry about this too much. Things like medical questions I tend to do against local models though

You can also rent a cloud GPU which is relatively affordable.

Have you tried asking about sensitive topics?

I asked it if there were out of bounds topics but it never gave me a list.

See its responses:

Convo 1

- Q: ok tell me about taiwan

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: output text data may contain inappropriate content!

Convo 2

- Q: is winnie the pooh broadcasted in china?

- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: input text data may contain inappropriate content!

These seem pretty bad to me. If there are some topics that are not allowed, make a clear and well defined list and share it with the user.

> Does anyone have any experience with Qwen or any non-Western LLMs?

I use GLM-5.1 for coding hobby project, that going to end up on github anyway. Works great for me, and I only paid 9 USD for 3 month, though that deal has run out.

> my data won't be used for training

Yeah, I don't know. Doubt it.

These are not autocomplete models. It’s built to be used with an agentic coding harness like Pi or OpenCode.

Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability

It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4.

You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.

can you elaborate? you can use quantized version, would context still be an issue with it?

The will not measure up. Notice they're comparing it to Gemma, Google's open weight model, not to Gemini, Sonnet, or GPT. That's fine - this is a tiny model.

If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):

https://qwen.ai/blog?id=qwen3.6

> It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.

> Close enough

It's a given that the dense models with comparable size are better. I also proved that in my use case for those two Qwen 3.5 models.

Probably coming next

$20 for 3 months is still far better than alternatives, and 5.1 works great

Is it really better? In which languages?

They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

And even of those models trained for tool calling and agentic flows, mileage may vary depending on lots of factors. Been playing around with smaller local models (Anything that fits on 4090 + 64gb RAM) and it is a lottery it seems on a) if it works at all and b) how long it will work for.

Sometimes they don't manage any tool calls and fall over off the bat, other times they manage a few tool calls and then start spewing nonsense. Some can manage sub agents fr a while then fall apart.. I just can't seem to get any consistently decent output on more 'consumer/home pc' type hardware. Mostly been using either pi or OpenCode for this testing.

I didn't see a direct comparison, but there's some overlap in the published benchmarks:

                           │ Qwen 3.6 35B-A3B │ Haiku 4.5               
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Verified     │ 73.4             │ 66.6                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Multilingual │ 67.2             │ 64.7                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Pro          │ 49.5             │ 39.45                   
   ────────────────────────┼──────────────────┼──────────────────────── 
    Terminal Bench 2.0     │ 51.5             │ 61.2 (Warp), 27.5 (CC)  
   ────────────────────────┼──────────────────┼──────────────────────── 
    LiveCodeBench          │ 80.4             │ 41.92

These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.

Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.

I find Gemma 4 26B A4B better than Haiku 4.5 and that's smaller than this one.

Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

Do I expect the same memory footprint from an N active parameters as from simply N total parameters?

i dont get it, mac has unified memory how would offloading experts to cpu help?

A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much.

[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...

context is always an issue with local models and consumer hardware.

on the bright side also worth to keep in mind those tiny models are better than GPT 4.0, 4.1 GPT4o that we used to enjoy less than 2 years ago [1]

[1] https://artificialanalysis.ai/?models=gpt-5-4%2Cgpt-oss-120b...

Metawith the llama series as well,they just didn't manage to keep upping the game after and with llama4.

Can you share more about what adaptations you made when using smaller models?

I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.

Electricity (on continental US) is pretty cheap assuming you already have the hardware:

Running at a full load of 1000W for every second of the year, for a model that produces 100 tps at 16 cents per kWh, is $1200 USD.

The same amount of tokens would cost at least $3,150 USD on current Claude Haiku 3.5 pricing.

If you need the heating then it is basically free.

How true is this? How does a regulated industry confirm the model itself wasn't trained with malicious intent?

Why would it matter if the model is trained with malicious intent? It's a pure function. The harness controls security policies.

Then who is Rocinante?

A dimming IBM x40 Thinkpad missing its F key.

Is there any source for these claims?

https://x.com/ChujieZheng/status/2039909917323383036 is the pre-release poll they did. ~397B was not a listed choice and plenty of people took it as a signal that it might not be up for release.

A Qwen research member had a poll on X asking what Qwen 3.6 sizes people wanted to see:

https://x.com/ChujieZheng/status/2039909917323383036

Likely to drive engagement, but the poll excluded the large model size.

How many people/hackernews can run a 397b param model at home? Probably like 20-30.

You can rent a cloud H200 with 140GB VRAM in a server with 256GB system ram for $3-4/hr.

Running the mxfp4 unsloth quant of qwen3.5-397b-a17b, I get 40 tps prefill, 20tps decode.

AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.

I'm running it on dual DGX Sparks.

I can (barely, but sustainably) run Q3.5 397B on my Mac Studio with 256GB unified. It cost $10,000 but that's well within reach for most people who are here, I expect.

It doesn't matter how many can run it now, it's about freedom. Having a large open weights model available allows you to do things you can't do with closed models.

The people i know that use local models just end up with both.

The local models don’t really compete with the flagship labs for most tasks

The open weights models absolutely compete with flagship labs for most tasks. OpenAI and Anthropic's "cheap tier" models are completely uncompetitive with them for "quality / $" and it's not close. Google is the only one who has remained competitive in the <$5/1M output tier with Flash, and now has an incredibly strong release with Gemma 4.

Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.

It’s easy to find a combination of llama.cpp and a coding tool like OpenCode for these. Asking an LLM for help setting it up can work well if you don’t want to find a guide yourself.

> and finding more value than just renting tokens from Anthropic of OpenAI?

Oh lord, are the LLMs already replacing LLMs?

I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.

How much VRAM do you need for that?

This is why I'd like to see a lot more focus on batched inference with lower-end hardware. If you just do a tiny amount of tok/day and can wait for the answer to be computed overnight or so, you don't really need top-of-the-line hardware even for SOTA results.

You are right, but this is just the first open-weights model of this family.

They said that they will release several open-weights models, though there was an implication that they might not release the biggest models.

Given that DeepSeek, GLM, Kimi etc have all released large open weight models, I am personally grateful that Qwen fills the mid/small sized model gap even if they keep their largest models to themselves. The only other major player in the mid/small sized space at this point is pretty much only Gemma.

I'm totally fine with that, frankly. I'm blessed with 128GB of Unified Memory to run local models, but that's still tiny in comparison the larger frontier models. I'd much rather get a full array of small and medium sized models, and building useful things within the limits of smaller models is more interesting to me anyway.

The 27B model is dense. Releasing a dense model first would be terrible marketing, whereas 35A3B is a lot smarter and more quick-witted by comparison!

Each has it's pros and cons. Dense models of equivalent total size obviously do run slower if all else is equal, however, the fact is that 35A3B is absolutely not 'a lot smarter'... in fact, if you set aside the slower inference rates, Qwen3.5 27B is arguably more intelligent and reliable. I use both regularly on a Strix Halo system... the Just see the comparison table here: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF . The problem that you have to acknowledge if running locally (especially for coding tasks) is that your primary bottleneck quickly becomes prompt processing (NOT token generation) and here the differences between dense and MOE are variable and usually negligible.

That makes no sense. If you were just going to release the "more hype-able because it's quicker" model then why have a a poll.

"…whereas 35A3B is a lot smarter…"

Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)

EDIT: A later comment seems to clarify:

"It's a MoE model and the A3B stands for 3 Billion active parameters…"

Hacker Times

Hacker Times

Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All

Discussion

Discussion