Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Going to have to disagree on the backup test. Opus flamingo is actually on the pedals and seat with functional spokes and beak. In terms of adherence to physical reality Qwen is completely off. To me it's a little puzzling that someone would prefer the Qwen output.

I'd say the example actually does (vaguely) suggest that Qwen might be overfitting to the Pelican.

I understand the 'fun factor' but at this point I really wonder what this pelican still proofs ? I mean, providers certainly could have adapted for it if they wanted, and if you want to test how well a model adapts to potential out of distribution contexts, it might be more worthwhile to mix different animals with different activity types (a whale on a skateboard) than always the same.

Such a disconnect from the minutes I’ve lost and given up on Gemini trying to get it to update a diagram in a slide today. The one shot joke stuff is great but trying to say “that is close but just make this small change” seems impossible. It’s the gap between toy and tool.

For coding, qwen 3.6 35b a3b solved 11/98 of the Power Ranking tasks (best-of-two), compared to 10/98 for the same size qwen 3.5. So it's at best very slightly improved and not at all in the class of qwen 3.5 27b dense (26 solved) let alone opus (95/98 solved, for 4.6).

That's not surprising; Opus & Sonnet have been regressing on many non-coding tasks since about the 4.1 release in our testing

I really wish they spent some time training for computer use. This model is incapable of finding anywhere near the correct x,y coordinate of a simple object in a picture.

I'm an iguana and need to wash my bicycle in the carwash. Shall I walk or take the bus?

I've been using Qwen3.5-35B-A3B for a bit via open code and oMLX on M5 Max with 128Gb of RAM and I have to say it's impressively good for a model of that size. I've seen a huge jump in the quality of the tool calls and how well it handles the agentic workflow.

I'm really curious about what competes with Claude Code to drive a local LLM like Qwen 3.6?

That Qwen flamingo on the unicycle is actually quite good. A work of art.

I'm currently testing Qwen3.6-35B-A3B with https://swival.dev for security reviews.

It's pretty good at finding bugs, but not so good at writing patches to fix them.

FYI, using a 128GB M5 MacBook Pro, sourced from another article by the author.

I literally cannot believe that people are wasting their time doing this either as a benchmark or for fun. After every single language model release, no less.

How about switching to MechaStalin on a tricycle? It gets kind of boring.

I'd say the example actually does (vaguely) suggest that Qwen might be overfitting to the Pelican.

Qwen's flamingo is artistically far more interesting. It's a one-eyed flamingo with sunglasses and a bow tie who smokes pot. Meanwhile Opus just made a boring, somewhat dorky flamingo. Even the ground and sky are more interesting in Qwen's version

But in terms of making something physically plausible, Opus certainly got a lot closer

Even the first one - Qwen added extra details in the background sure. But he Pelican itself is a stork with a bent beak and it's feet is cut off it's legs. While impressive for a local model, I don't think it's a winner.

But in terms of making something physically plausible, Opus certainly got a lot closer

Given adherence is a more significant practical barrier, it's probably the better signal. That is, if we decide too look for signal here.

I really wish they spent some time training for computer use. This model is incapable of finding anywhere near the correct x,y coordinate of a simple object in a picture.

That's not surprising; Opus & Sonnet have been regressing on many non-coding tasks since about the 4.1 release in our testing

You compare tiny modal for local inference vs propertiary, expensive frontier model. It would be more fair to compare against similar priced model or tiny frontier models like haiku, flash or gpt nano.

I'm currently testing Qwen3.6-35B-A3B with https://swival.dev for security reviews.

It's pretty good at finding bugs, but not so good at writing patches to fix them.

That Qwen flamingo on the unicycle is actually quite good. A work of art.

FYI, using a 128GB M5 MacBook Pro, sourced from another article by the author.

Given adherence is a more significant practical barrier, it's probably the better signal. That is, if we decide too look for signal here.

I'm an iguana and need to wash my bicycle in the carwash. Shall I walk or take the bus?

That’s a long walk! You should reserve a ride with $PartnerRideshareCo.

You should have the pelican ride it to the carwash and wash it for you.

They're certainly aware of the test, but a turtle doing a kickflip on a skateboard? I seriously doubt they train their models for that.

https://x.com/JeffDean/status/2024525132266688757

If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx

That's why I did the flamingo on a unicycle.

For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.

This is about the newly release Qwen3.6. Just wanted to make sure you got that correctly.

I'm really curious about what competes with Claude Code to drive a local LLM like Qwen 3.6?

I literally cannot believe that people are wasting their time doing this either as a benchmark or for fun. After every single language model release, no less.

It feels like the results stopped being interesting a little while ago but the practice has become part of simonw's brand, and it gives him something to post even when there is nothing interesting to say about another incremental improvement to a model, and so I don't imagine he'll stop.

How about switching to MechaStalin on a tricycle? It gets kind of boring.

boring ... the ways all the models fail at a simple task never gets boring to me

That’s a long walk! You should reserve a ride with $PartnerRideshareCo.

You should have the pelican ride it to the carwash and wash it for you.

They're certainly aware of the test, but a turtle doing a kickflip on a skateboard? I seriously doubt they train their models for that.

https://x.com/JeffDean/status/2024525132266688757

If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx

Not when the article they're commenting on was doing literally exactly the same thing.

Eh it’s important perspective, lest someone start thinking they can drop $5k on a laptop and be free of Anthropic/OpenAI. Expensive lesson.

That's why I did the flamingo on a unicycle.

For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.

This is about the newly release Qwen3.6. Just wanted to make sure you got that correctly.

It is completely wild to me that you prefer Qwen's flamingo. I think it's really bad and Opus' is pretty good.

r/LocalLlama is now doing a horse in a racing car:

https://redd.it/1slz38i

To me the opus flamingo is waaaay better than the qwen one. qwen has the better pelican, though.

Is a flamingo on a unicycle not merely a special case of a pelican on a bicycle?

I, for one, expected progress. Uneven, sometimes delayed, but ever increasing progress.

But that Opus pelican?

Not when the article they're commenting on was doing literally exactly the same thing.

boring ... the ways all the models fail at a simple task never gets boring to me

Eh it’s important perspective, lest someone start thinking they can drop $5k on a laptop and be free of Anthropic/OpenAI. Expensive lesson.

r/LocalLlama is now doing a horse in a racing car:

https://redd.it/1slz38i

To me the opus flamingo is waaaay better than the qwen one. qwen has the better pelican, though.

It is completely wild to me that you prefer Qwen's flamingo. I think it's really bad and Opus' is pretty good.

The Opus one doesn't even have a bowtie.

Is a flamingo on a unicycle not merely a special case of a pelican on a bicycle?

I, for one, expected progress. Uneven, sometimes delayed, but ever increasing progress.

But that Opus pelican?

The Opus one doesn't even have a bowtie.

The Opus one looks like a flamingo, and looks like it's riding the unicycle. Sitting on the seat. Feet on the pedals.

The Qwen one looks like a 3-tailed, broken-winged, beakless (I guess? Is that offset white thing a beak? Or is it chewing on a pelican feather like it's a piece of straw?) monstrosity not sitting on the seat, with its one foot off the pedal (the other chopped off at the knee) of a malmanufactured wheel that has bonus spokes that are longer than the wheel.

But yeah, it does have a bowtie and sunglasses that you didn't ask for! Plus it says "<3 Flamingo on a Unicycle <3", which perhaps resolves all ambiguity.

The Opus one looks like a flamingo, and looks like it's riding the unicycle. Sitting on the seat. Feet on the pedals.

But yeah, it does have a bowtie and sunglasses that you didn't ask for! Plus it says "<3 Flamingo on a Unicycle <3", which perhaps resolves all ambiguity.

16th April 2026

For anyone who has been taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from this morning’s two big model releases—Qwen3.6-35B-A3B from Alibaba and Claude Opus 4.7 from Anthropic.

Here’s the Qwen 3.6 pelican, generated using this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf quantized model by Unsloth, running on my MacBook Pro M5 via LM Studio (and the llm-lmstudio plugin)—transcript here:

The bicycle frame is the correct shape. There are clouds in the sky. The pelican has a dorky looking pouch. A caption on the ground reads Pelican on a Bicycle!

And here’s one I got from Anthropic’s brand new Claude Opus 4.7 (transcript):

The bicycle frame is entirely the wrong shape. No clouds, a yellow sun. The pelican is looking behind itself, and has a less pronounced pouch than I would like.

I’m giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!

I tried Opus a second time passing thinking_level: max. It didn’t do much better (transcript):

The bicycle frame is entirely the wrong shape but in a different way. Lines are more bold. Pelican looks a bit more like a pelican.

I don’t think Qwen are cheating

A lot of people are convinced that the labs train for my stupid benchmark. I don’t think they do, but honestly this result did give me a little glint of suspicion. So I’m burning one of my secret backup tests—here’s what I got from Qwen3.6-35B-A3B and Opus 4.7 for “Generate an SVG of a flamingo riding a unicycle”:

I’m giving this one to Qwen too, partly for the excellent  SVG comment.

What can we learn from this?

The pelican benchmark has always been meant as a joke—it’s mainly a statement on how obtuse and absurd the task of comparing these models is.

The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those first pelicans from October 2024 were junk. The more recent entries have generally been much, much better—to the point that Gemini 3.1 Pro produces illustrations you could actually use somewhere, provided you had a pressing need to illustrate a pelican riding a bicycle.

Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic’s latest proprietary release.

If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!

Hacker Times

Hacker Times

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Discussion

Discussion

I don’t think Qwen are cheating

What can we learn from this?