To be honest, it's sort of what I expected governments to be funding right now, but I suppose Chinese companies are a close second.
the qwen is dead, long live the qwen.
I've been testing Qwen3.5-35B-A3B over the past couple of days and it's a very impressive model. It's the most capable agentic coding model I've tested at that size by far. I've had it writing Rust and Elixir via the Pi harness and found that it's very capable of handling well defined tasks with minimal steering from me. I tell it to write tests and it writes sane ones ensuring they pass without cheating. It handles the loop of responding to test and compiler errors while pushing towards its goal very well.
Is there a better agentic coding harness people are using for these models? Based on my experience I can definitely believe the claims that these models are overfit to Evals and not broadly capable.
Wild times!
4th March 2026
I’m behind on writing about Qwen 3.5, a truly remarkable family of open weight models released by Alibaba’s Qwen team over the past few weeks. I’m hoping that the 3.5 family doesn’t turn out to be Qwen’s swan song, seeing as that team has had some very high profile departures in the past 24 hours.
It all started with this tweet from Junyang Lin (@JustinLin610):
me stepping down. bye my beloved qwen.
Junyang Lin was the lead researcher building Qwen, and was key to releasing their open weight models from 2024 onwards.
As far as I can tell a trigger for this resignation was a re-org within Alibaba where a new researcher hired from Google’s Gemini team was put in charge of Qwen, but I’ve not confirmed that detail.
More information is available in this article from 36kr.com. Here’s Wikipedia on 36Kr confirming that it’s a credible media source established in 2010 with a good track record reporting on the Chinese technology industry.
The article is in Chinese—here are some quotes translated via Google Translate:
At approximately 1:00 PM Beijing time on March 4th, Tongyi Lab held an emergency All Hands meeting, where Alibaba Group CEO Wu Yongming frankly told Qianwen employees.
Twelve hours ago (at 0:11 AM Beijing time on March 4th), Lin Junyang, the technical lead for Alibaba’s Qwen Big Data Model, suddenly announced his resignation on X. Lin Junyang was a key figure in promoting Alibaba’s open-source AI models and one of Alibaba’s youngest P10 employees. Amidst the industry uproar, many members of Qwen were also unable to accept the sudden departure of their team’s key figure.
“Given far fewer resources than competitors, Junyang’s leadership is one of the core factors in achieving today’s results,” multiple Qianwen members told 36Kr. [...]
Regarding Lin Junyang’s whereabouts, no new conclusions were reached at the meeting. However, around 2 PM, Lin Junyang posted again on his WeChat Moments, stating, “Brothers of Qwen, continue as originally planned, no problem,” without explicitly confirming whether he would return. [...]
That piece also lists several other key members who have apparently resigned:
With Lin Junyang’s departure, several other Qwen members also announced their departure, including core leaders responsible for various sub-areas of Qwen models, such as:
Binyuan Hui: Lead Qwen code development, principal of the Qwen-Coder series models, responsible for the entire agent training process from pre-training to post-training, and recently involved in robotics research.
Bowen Yu: Lead Qwen post-training research, graduated from the University of Chinese Academy of Sciences, leading the development of the Qwen-Instruct series models.
Kaixin Li: Core contributor to Qwen 3.5/VL/Coder, PhD from the National University of Singapore.
Besides the aforementioned individuals, many young researchers also resigned on the same day.
Based on the above it looks to me like everything is still very much up in the air. The presence of Alibaba’s CEO at the “emergency All Hands meeting” suggests that the company understands the significance of these resignations and may yet retain some of the departing talent.
This story hits particularly hard right now because the Qwen 3.5 models appear to be exceptionally good.
I’ve not spent enough time with them yet but the scale of the new model family is impressive. They started with Qwen3.5-397B-A17B on February 17th—an 807GB model—and then followed with a flurry of smaller siblings in 122B, 35B, 27B, 9B, 4B, 2B, 0.8B sizes.
I’m hearing positive noises about the 27B and 35B models for coding tasks that still fit on a 32GB/64GB Mac, and I’ve tried the 9B, 4B and 2B models and found them to be notably effective considering their tiny sizes. That 2B model is just 4.57GB—or as small as 1.27GB quantized—and is a full reasoning and multi-modal (vision) model.
It would be a real tragedy if the Qwen team were to disband now, given their proven track record in continuing to find new ways to get high quality results out of smaller and smaller models.
If those core Qwen team members either start something new or join another research lab I’m excited to see what they do next.
It's also driving itself crazy with deadpool & deadpool-r2d2 that it chose during planning phase.
That said, it does seem to be doing a very good job in general, the code it has created is mostly sane other than this fuss over the database layer, which I suspect I'll have to intervene on. It's certainly doing a better job than other models I'm able to self-host so far.
The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked, and I find it has stripped all the preliminary support infrastructure for the new feature out of the code.
They also struggle at translating very broad requirements to a set of steps that I find acceptable. Planning helps a lot.
Regarding the harness, I have no idea how much they differ but I seem to have more luck with https://pi.dev than OpenCode. I think the minimalism of Pi meshes better with the limited capabilities of open models.
I think this is part of the model’s success. It’s cheap enough that we’re all willing to let it run for extremely long times. It takes advantage of that by being tenacious. In my experience it will just keep trying things relentlessly until eventually something works.
The downside is that it’s more likely to arrive at a solution that solves the problem I asked but does it in a terribly hacky way. It reminds me of some of the junior devs I’ve worked with who trial and error their way into tests passing.
I frequently have to reset it and start it over with extra guidance. It’s not going to be touching any of my serious projects for these reasons but it’s fun to play with on the side.
That's likely coming from the 3:1 ratio of linear to quadratic attention usage. The latest DeepSeek also suffers from it which the original R1 never exhibited.
I can live with this on my own hardware. Where Opus4.6 has developed this tendency to where it will happily chew through the entire 5-hour allowance on the first instruction going in endless circles. I’ve stopped using it for anything except the extreme planning now.
That sounds too close to what I feel on some days xD
If AI could effectively replace people, you wouldn’t need CEOs to keep trying to convince people.
> Blah blah blah (second guesses its own reasoning half a dozen times then goes). Actually, it would be a simpler to just ...
Specifically on Antigravity, I've noticed it doing that trying to "save time" to stay within some artificial deadline.
It might have something to do with the system messages and the reinforcement/realignment messages that are interwoven into the context (but never displayed to end-users) to keep the agents on task.
I’m trying to use local models whenever possible. Still need to lean on the frontier models sometimes.
This is my experience with the Qwen3-Next and Qwen3.5 models, too.
I can prompt with strict instructions saying "** DO NOT..." and it follows them for a few iterations. Then it has a realization that it would be simpler to just do the thing I told it not to do, which leads it to the dead end I was trying to avoid.
[0] https://www.reddit.com/r/LocalLLaMA/comments/1rivckt/visuali...
Probably good to sent alerts early, but they might be going a bit too early.
> Do you feel you could replace the frontier models with it for everyday coding? Would/will you?
Probably not yet, but it's really good at composing shell commands. For scripting or one-liner generation, the A3B is really good. The web development skills are markedly better than Qwen's prior models in this parameter range, too.
Qwen3.5-35B-A3B means that the model itself consists of 35 billion floating point numbers - very roughly 35GB of data - which are all loaded into memory at once.
But... on any given pass through the model weights only 3 billion of those parameters are "active" aka have matrix arithmetic applied against them.
This speeds up inference considerably because the computer has to do less operations for each token that is processed. It still needs the full amount of memory though as the 3B active it uses are likely different on every iteration.
That would seem logical, as the results are then completely deterministic, but it turns out that a suboptimal token may result in a better answer in the long run. Also, allowing for a little bit of noise gives the model room to talk itself out of a suboptimal path.
In any case, two nines of reliability is not impressive.
There would never be an Anthropic/Pentagon situation in China, because in China there isn't actually separation between the military and any given AI company. The party is fully in control.
But I'll be running this locally for note summarization, code review, and OCR. Very coherent for its size.
I wonder if determinism will be less harmful to diffusion models because they perform multiple iterations over the response rather than having only a single shot at each position that lacks lookahead. I'm looking forward to finding out and have been playing with a diffusion model locally for a few days.
I'm aligning on Agent for the combination of harness + model + context history (so after you fork an agent you now have two distinct agents)
And orchestrator means the system to run multiple agents together.
This terminology is still very much undefined though, so my version may not be the winning definition.
It's really easy to setup with any OpenAI compatible API and I self host Qwen Coder 3 Next on my personal MBP using LM Studio and just dial in from my work laptop with Zed and tailscale so i can connect from wherever i might be. It's able to do all sorts of things like run linting checks and tests and look for issues and refactor code and create files and things like this. I'm definitely still learning, but it's a pretty exciting jump from just talking to a chat bot and copying and pasting things manually.
For creative things or exploratory reasoning, a temperature of 0.8 lends us to all sorts of excursions down the rabbit hole. However, when coding and needing something precise, a temperature of 0.2 is what I use. If I don’t like the output, I’ll rephrase or add context.