Qwen3.5 is doing ok on my limited tests: https://aibenchy.com
Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.
What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.
> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."
Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?
Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.
Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).
We'll see where the 3rd party inference providers will settle wrt cost.
One big factor for local LLMs is that large context windows will seemingly always require large memory footprints. Without a large context window, you'll never get that Opus 4.6-like feel.
Whatever workflow lead to that?
> I might have "dark" mode on on Chrome + MacOS.
Probably that's the reason.
> "I want to wash my car. The car wash is 50m away. Should I drive or walk?"
And some LLMs seem to tell you to walk to the carwash to clean your car... So it's the new strawberry test
Also, don't forget that Mixture of Experts (MoE) models perform better than you'd expect, because only a small part of the model is actually "active" - so e.g. a Qwen3-whatever-80B-A3B would be 80 billion total, but 3 billion active- worth trying if you've got enough system ram for the 80 billion, and enoguh vram for the 3.
[1]: https://github.com/huggingface/transformers/tree/main/src/tr...
> News
> 2026-02-16: More sizes are coming & Happy Chinese New Year!
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
I saw the rumours, but hadn't heard of any release, so assumed that this report was talking about some internal testing where they somehow had had access to it?
Bizarre
I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D
At 80B, you could do 2 A6000s.
What device is 128gb?
People can always distill them.
I'm sure it can do 2+2= fast
After that? No way.
There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.
What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?
Download every github repo
-> Classify if it could be used as an env, and what types
-> Issues and PRs are great for coding rl envs
-> If the software has a UI, awesome, UI env
-> If the software is a game, awesome, game env
-> If the software has xyz, awesome, ...
-> Do more detailed run checks,
-> Can it build
-> Is it complex and/or distinct enough
-> Can you verify if it reached some generated goal
-> Can generated goals even be achieved
-> Maybe some human review - maybe not
-> Generate goals
-> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
... Do the rest of the normal RL env stuffIt's basically the same as with the Qwen2.5 and 3 series but this time with 1M context and 200k native, yay :)
This isn't a reason not to use Qwen. It just means having a sense of the constraints it was developed under. Unfortunately, populist political pressure to rewrite history is being applied to the American models as well. This means its on us to apply reasonable skepticism to all models.
I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
So we might have an outer alignment failure.
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
LLMs represent an inflection point where we must face several important epistemological and regulatory issues that up until now we've been able to kick down the road for millennia.
So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.
The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.
Benchmaxxing is the norm in open weight models. It has been like this for a year or more.
I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.
Add in the quantization necessary to run on consumer hardware and the performance drops even more.
The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.
this part is nontrivial though
It can notably run some of the best open weight models with little power and without triggering its fan.
So then the next next version is even better, because it got more data / better data. And it becomes better...
This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.
For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.
But there are topics that ChatGPT hard blocks just like Qwen [1].
[1] https://www.independent.co.uk/tech/chatgpt-ai-david-mayer-op...
In the thinking section it didn't really register the car and washing the car as being necessary, it solely focused on the efficiency of walking vs driving and the distance.
This is why I'm personally waiting for M5/M6 to finally have some decent prompt processing performance, it makes a huge difference in all the agentic tools.
This is how I know something is fishy.
No one cares about this. This became a new benchmark when Apple couldn't compete anywhere else.
I understand if you already made the mistake of buying something that doesn't perform as well as you were expecting, you are going to look for ways to justify the purchase. "It runs with little power" is on 0 people's christmas list.
It’s also good value if you want a lot of memory.
What would you advice for people with a similar budget? It’s a real question.
There is novelty, but not practical use case.
My $700, 2023, 3060 laptop runs 8B models. At the enterprise level we got 2, A6000s.
Both are useful and were used for economic gain. I don't think you have gotten any gain.