Qwen3.6-Plus: Towards real world agents

This is their hosted-only model, not an open weight model like they’ve become known for. They got a lot of good publicity for their open weight model releases, which was the goal. The hard part is pivoting from an open weight provider to being considered as a competitor to Claude and ChatGPT. Initial reactions are mostly anger from everyone who didn’t realize that the play along was to give away the smaller models as advertising, not because they were feeling generous.

Comparing to Opus 4.5 instead of the current 4.6 and other last-gen models is clearly an attempt to deceive, which isn’t winning them any points either.

I think there is a moderately large market for models like this that aren’t quite SOTA level but can be served up much cheaper. I don’t know how successful they’ll be in the race to the bottom in this market niche, though. Most users of cheap API tokens are not loyal to any brand and will change providers overnight each time someone releases a slightly better model.

I understand peoples reactions of Qwen team comparing against Opus 4.5 instead of 4.6. And them comparing against Gemini Pro 3.0 instead of 3.1. But calling it misleading is a bit of stretch in my eyes, people here are acting like we immediately forgot how previous generations performed just because a new version is released.

This field is going in a incredible pace, the providers release a new model every quarter or so. The amount of criticism is a bit overblown in my opinion. The benchmarks still look very good to me. I’ve used GLM-5 (latest is GLM-5.1) and Kimi K2.5, they are decent and gets the job done, so seeing how this model of Qwen performs compared to it is kinda impressive.

Also, why are so many pointing out the fact that this model is not open-weight as if this is their first time doing so. Qwen-3.5-plus, Qwen-3-Max is also closed source. This is not something new.

I think Qwen trying to catch up to the SOTA models is still healthy for us, the consumers. Sure, its sad news that this version is closed-weight, but I won’t downplay their progress.

Pretty solid Pelican: https://gist.github.com/simonw/ca081b679734bc0e5997a43d29fad...

I used the https://modelstudio.alibabacloud.com/ API to generate that one, which required signing up for an account and attaching PayPal billing - but it looks like OpenRouter are offering it for free right now so I could have used that: https://openrouter.ai/qwen/qwen3.6-plus:free

Worth noting that this model, unlike almost all qwen models, is not open-weight, nor is the parameter count exposed. Also odd that it is compared against opus 4.5 even though 4.6 was released like 2 months ago.

I'll diverge from some of these comments, I don't find it misleading to compare to Opus 4.5.

I can remember how good Opus 4.5 was. If I'm considering using this, it's most informative to me to compare to the model it's closest to that I have familiarity with.

I'm obviously not switching to this if I want the best model. I'm switching if I'm hopeful that the smaller versions are close to it, or if I want to have more options for providers, or for any other reasons unrelated to getting the highest quality responses possible.

I’m surprised that people are surprised. Qwen has been hosting private plus and max variants for a while now.

Just more evidence that the B tier models are six months behind. Ultimately that’s good. Opus 4.6 level intelligence will be cheap later this year!

The benchmarks provided are for Opus-4.5, not for the latest Opus-4.6 and Qwen is still lagging in a lot of them.

Looking forward to when this gets on Bedrock. I built an app with a niche AI agent and to this point only Sonnet is really good enough for our use case, but its expensive!

> In the coming days, we will also open-source smaller-scale variants, reaffirming our commitment to accessibility and community-driven innovation.

I hope their open source variants are just as good, having a 1 million token window for a fully offline model would be VERY interesting.

The agent benchmarks here are interesting but I'd love to see how Qwen3.6-Plus handles long-horizon tasks where it needs to recover from its own mistakes. Most agent evals test the happy path. The hard part is when the model takes a wrong action at step 3 and needs to recognize and backtrack at step 15. Has anyone stress-tested this in a real dev workflow?

It hallucinates a lot more then Sonnet or even MiniMax M2.5. Especially in tool calls, it would end up duplicating the content in code files and then realising later and getting stuck in a loop.

I would love to hear from people using both (Claude Code OR Codex) AND (Qwen) and their experience with Qwen models, are they on par, or how far are they?

It is no longer available on OpenRouter. They say "going away on 3-March", but it's already gone!

Nice, I hope there will also come a small open version of it.

How convenient of them to compare themselves to the last generation Opus and GPT models to make their model look better than it really is.

the comparison is helpful but i'd want to see how it handles edge cases

It's not open weights so I'm not interested.

Does anyone have experience with Alibaba's coding plan? Not that I'm very tempted at $50/month...

Quite strong results in the benchmarks but why Gemini 3 Pro instead of 3.1? Why only for a few of the benchmarks? Why is OpenAI not there in the coding benchmarks? Why Opus 4.5 and not 4.6? Just jumps out into my eye as a bit strange.

As always, we'll have to try and see how it performs in the real world but the open weight models of Qwen were pretty decent for some tasks so still excited to see what this brings.

Not really interested in using models hosted on alibaba cloud.

Like Qwen local for it’s privacy, but I trust the privacy of Google/OpenAI/Anthropic more than alibaba.

I’m surprised that people are surprised. Qwen has been hosting private plus and max variants for a while now.

Just more evidence that the B tier models are six months behind. Ultimately that’s good. Opus 4.6 level intelligence will be cheap later this year!

Also, why are so many pointing out the fact that this model is not open-weight as if this is their first time doing so. Qwen-3.5-plus, Qwen-3-Max is also closed source. This is not something new.

I think Qwen trying to catch up to the SOTA models is still healthy for us, the consumers. Sure, its sad news that this version is closed-weight, but I won’t downplay their progress.

Opus 4.5 is already pretty good.

Opus 4.5 is $25/m output tokens.

This is at most $6/m output tokens.

That's ~1/4 the price.

I think it’s more the principle of deception that upsets people. Imagine if Apple released a new iPhone and publicly compared its specs to some previous gen Android. It’s not in good faith.

Comparing to Opus 4.5 instead of the current 4.6 and other last-gen models is clearly an attempt to deceive, which isn’t winning them any points either.

> not an open weight model like they’ve become known for.

Right, they state that they'll release "smaller" variants openly at some point, with few details as to what that means. Will there be a ~300B variant as with Qwen 3.5? The blog post doesn't say.

Ah, so that explains the recent wave of Qwen team-member departures.

> Initial reactions are mostly anger from everyone who didn’t realize that the play along was to give away the smaller models as advertising, not because they were feeling generous.

The naivety around this has been staggering quite frankly. All of a sudden, people thinking that meta etc are releasing free models because they believe in open access and distribution of knowledge. No, they just suck comparatively. There is nothing to sell. Using it to recruit and generate attention is the best play for them.

I’m starting to wonder where the most is for any of these models.

Sure they are not cheap to train. But if open weight models continue to be trained and continue to become available on cheaper hardware, how do dedicated AI companies protect their margins?

4.5 is better than 4.6 though in practice. 4.6 was purely a cost savings change with enough benchmark gamification to look better.

Opus was released in Feb 2026. Even though it feels like a long 2 months has passed, its' not really clear that they were developing this as a competitor to that product.

There's nothing really strange about not competing directly with the best, but rather showing whom you are as good as.

> I think there is a moderately large market for models like this that aren’t quite SOTA level but can be served up much cheaper.

There isn't, pretty much everyone wants the best of the best.

How stupid somebody has to be to mix up Opus with Qwen?

They said in the last paragraph[0]:

"[...] In the coming days, we will also open-source smaller-scale variants, reaffirming our commitment to accessibility and community-driven innovation. [...]"

[0] https://qwen.ai/blog?id=qwen3.6#summary--future-work

If Opus 4.6 was only released two months ago, then it seems reasonable that Qwen hasn't finished fully comparing against the latest Opus.

I wouldn't say "almost all" seeing as -MAX and -Omni models were always closed.

I'll diverge from some of these comments, I don't find it misleading to compare to Opus 4.5.

I can remember how good Opus 4.5 was. If I'm considering using this, it's most informative to me to compare to the model it's closest to that I have familiarity with.

Exactly this. If you can get something close to Opus 4.5 for free, that's noteworthy. I may not use it for the most critical pieces of my app, but not everything I do is galaxy-brain coding.

Yes, honestly, Opus 4.6 and GPT 5.4 were mostly not really noticeable improvements over 4.5 and 5.3 respectively. If we were stuck at 4.5 levels but at 1/10th of the price, I'll take it.

Pretty solid Pelican: https://gist.github.com/simonw/ca081b679734bc0e5997a43d29fad...

The benchmarks provided are for Opus-4.5, not for the latest Opus-4.6 and Qwen is still lagging in a lot of them.

> In the coming days, we will also open-source smaller-scale variants, reaffirming our commitment to accessibility and community-driven innovation.

Nice, I hope there will also come a small open version of it.

How convenient of them to compare themselves to the last generation Opus and GPT models to make their model look better than it really is.

It is no longer available on OpenRouter. They say "going away on 3-March", but it's already gone!

the comparison is helpful but i'd want to see how it handles edge cases

It's not open weights so I'm not interested.

As always, we'll have to try and see how it performs in the real world but the open weight models of Qwen were pretty decent for some tasks so still excited to see what this brings.

Opus 4.5 is already pretty good.

Opus 4.5 is $25/m output tokens.

This is at most $6/m output tokens.

That's ~1/4 the price.

Pelican is drafting rear peloton

they're going to start training a pelican riding a bike specifically on these models soon. it's the key global benchmark!

There is no reason to benchmark against Opus 4.5 when Opus 4.6 has been out so long, other than to be misleading.

And it seems they've decided to go closed-source for their largest, best models.

Looking forward to when this gets on Bedrock. I built an app with a niche AI agent and to this point only Sonnet is really good enough for our use case, but its expensive!

Try using Grok 4.1 reasoning. It's crazy cheap, and really it's not that bad.

I hope their open source variants are just as good, having a 1 million token window for a fully offline model would be VERY interesting.

I don't know how well it performs, but you can extend Qwen3.5 to 1 million token context using YaRN. Also, Nemotron 3 Super was recently released and scales up to 1 million token context natively.

I would love to hear from people using both (Claude Code OR Codex) AND (Qwen) and their experience with Qwen models, are they on par, or how far are they?

I switch between Claude Code (Opus/Sonnet) and Qwen (OpenCode, OpenClaw) multiple times throughout the day and Qwen 3.5 is really nice. I do also use KimiK2.5 and GLM5 pretty often too and I'm starting to get a sense that the agent tool is becoming a little more important than the model with these level of models. As long as tool calling and prompt quality is all configured correctly by the provider.

It hallucinates a lot more then Sonnet or even MiniMax M2.5. Especially in tool calls, it would end up duplicating the content in code files and then realising later and getting stuck in a loop.

My initial experiments are not encouraging. I have a basic planning prompt that includes instructions not to edit any files or implement anything. Qwen-3.6-Plus will consistently ignore that completely and proceed with implementation. I expect that kind of behavior from small models I run locally, not a hosted closed model claiming to compete with the frontier models.

Does anyone have experience with Alibaba's coding plan? Not that I'm very tempted at $50/month...

A bit off-topic but I’m on the legacy Lite plan (now discontinued), and it’s more than enough for hobby projects. The main draw is the generous request-based quota (18k requests/month) rather than a token-based one.

This means a 100k token request counts the same as a 100-token one. I’ve made about 8000 requests in the last two weeks, averaging around 80k tokens per request. It feels like they’re subsidizing this just to gather data on agentic workflows.

On the downside, the speed is mediocre (15–30 tg/s for GLM-5), and I’ve seen the model glitch or produce broken output about 10 times out of those 8k requests.