GLM-5: From Vibe Coding to Agentic Engineering

While GLM-5 seems impressive, this release also included lots of new cool stuff!

> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.

A new type of model has joined the series, GLM-5-Coder.

GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.

Looks like they also released their own agentic IDE, https://zcode.z.ai

I don’t know if anyone else knows this but Z.ai also released new tools excluding the Chat! There’s Zread (https://zread.ai), OCR (seems new? https://ocr.z.ai), GLM-Image gen https://image.z.ai and Voice cloning https://audio.z.ai

If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.

Very fascinating stuff!

Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.

I am using it with Claude Code and so far so good. Can't tell if it's as good as Opus 4.6 or not yet

The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.

Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.

Been playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.

Certainly seems to remember things better and is more stable on long running tasks.

benchmark and pricing made me realize how good kimi 2.5 is. im an opus 4.6 person but wow, its almost 5x cheaper.

What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].

US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.

[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...

[3] https://www.reuters.com/world/china/chinas-customs-agents-to...

So that was pony alpha (1). Now what's Aurora Alpha?

(1) https://openrouter.ai/openrouter/pony-alpha

I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.

It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.

Interesting timing — GLM-4.7 was already impressive for local use on 24GB+ setups. Curious to see when the distilled/quantized versions of GLM-5 drop. The gap between what you can run via API vs locally keeps shrinking. I've been tracking which models actually run well at each RAM tier and the Chinese models (Qwen, DeepSeek, GLM) are dominating the local inference space right now

If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :

Claude Opus 4.6: 65.5%

GLM-5: 62.6%

GPT-5.2: 60.3%

Gemini 3 Pro: 59.1%

Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing

Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?

There is also a GLM 5-code model.

GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clear reasoning traces that feel like Claude, which does result in the ability to inspect its reasoning to figure out why it made certain decisions.

So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.

I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.

I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.

Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.

What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?

How do you read this?

[1] https://imgur.com/a/EwW9H6q

GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.

Edit: Input tokens are twice as expensive. That might be a deal breaker.

They increased their prices substantially

I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.

It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.

Why are we not comparing to opus 4.6 and gpt 5.3 codex...

Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.

If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.

Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.

I predict a new speculative market will emerge where adherents buy and sell misween coded companies.

Betting on whether they can actually perform their sold behaviors.

Passing around code repositories for years without ever trying to run them, factory sealed.

The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0

why don't they publish at ARC-AGI ? too expensive?

I wish China starts copying Demis' biotech models as well soon

Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?

we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated

Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.

Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!

Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.

> it's comparing to last generation models (Opus 4.5 and GPT-5.2).

If it's anywhere close to those models, I couldn't possibly be happier. Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

> Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

Agreed. I think the problem is that while they can innovate at algorithms and training efficiency, the human part of RLHF just doesn't scale and they can't afford the massive amount of custom data created and purchased by the frontier labs.

IIRC it was the application of RLHF which solved a lot of the broken syntax generated by LLMs like unbalanced braces and I still see lots of these little problems in every open source model I try. I don't think I've seen broken syntax from the frontier models in over a year from Codex or Claude.

They are all just token generators without any intelligence. There is so little difference nowadays that I think in a blind test nobody will be able to differentiate the models - whether open source or closed source.

Today's meme was this question: "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?"

Here is Claude's answer just right now:

"Walk! At only 50 meters (about 150 feet), it would take you less than a minute to walk there. Driving such a short distance would actually be less convenient - by the time you get in the car, start it, drive, and park, you could already be there on foot. Plus, you'd save fuel and your car would be right there waiting for you after the wash is done."

Here is ChatGPT, also right now:

"Since it’s only *50 meters* (about 160 feet), walking makes a lot more sense.

Here’s why:

* *It’s extremely close* — you’ll get there in under a minute. * *Driving would actually be less efficient* — by the time you start the car, move it, and park, you’re already there. * *Cold starts aren’t great for your engine* — very short drives don’t let the engine warm up properly. * *It saves fuel and emissions*, even if it’s a tiny amount. * You avoid the hassle of maneuvering such a short distance.

The only time driving might make sense is if:

* The weather is terrible (heavy rain, snow, extreme cold). * You have mobility issues. * The car wash requires you to stay in the vehicle and you don’t want to leave it there.

Otherwise, this is a perfect “walk it over” situation. "

Please save us all that talk about frontier and SOTA and that only the closedAI models are any good and the others are all so bad and benchmaxxed. For most purposes a Toyota is just as good as a BMW or a Merc or whatever luxury brand tickles your fancy. Even worse, the lastest 80B Qwen Next is not far from Opus 4.6 but runs on my laptop.

I think the only advantage that closed models have are the tools around them (claude code and codex). At this point if forced I could totally live with open models only if needed.

I tried GLM 5 by API earlier this morning and was impressed.

Particularly for tool use.

come on guys, you were using Opus 4.5 literally a week ago and don't even like 4.6

something that is at parity with Opus 4.5 can ship everything you did in the last 8 weeks, ya know... when 4.5 came out

just remember to put all of this in perspective, most of the engineers and people here haven't even noticed any of this stuff and if they have are too stubborn or policy constrained to use it - and the open source nature of the GLM series helps the policy constrained organizations since they can theoretically run it internally or on prem.

Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing

Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?

There is also a GLM 5-code model.

I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.

I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.

why don't they publish at ARC-AGI ? too expensive?

Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?

Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!

Synthetic is a bless when it comes to providing OSS models (including GLM), their team is responsive, no downtime or any issue for the last 6 months.

Full list of models provided : https://dev.synthetic.new/docs/api/models

Referal link if you're interested in trying it for free, and discount for the first month : https://synthetic.new/?referral=kwjqga9QYoUgpZV

I think it's likely more expensive because they have more activated parameters, which kind of outweighs the benefits of DSA?

It's roughly three times cheaper than GPT-5.2-codex, which in turn reflects the difference in energy cost between US and China.

Qwen and GLM both promise the stars in the sky every single release and the results are always firmly in the "whatever" range

I place GLM 4.7 behind Sonnet.

yes, plenty of good convo over there, the two should probably be merged

Arc agi was never a good benchmark that tested spatial understanding more than reasoning. I'm glad it's no longer popular

dramatically cheaper.

I find 5.3 very impressive TBH. Bigger jump than Opus 4.6.

But this here is excellent value, if they offer it as part of their subscription coding plan. Paying by token could really add up. I did about 20 minutes of work and it cost me $1.50USD, and it's more expensive than Kimi 2.5.

Still 1/10th the cost of Opus 4.5 or Opus 4.6 when paying by the token.

> I think GPT-5.3-Codex was a disappointment

Care to elaborate more?

I think it's likely more expensive because they have more activated parameters, which kind of outweighs the benefits of DSA?

Qwen and GLM both promise the stars in the sky every single release and the results are always firmly in the "whatever" range

I place GLM 4.7 behind Sonnet.

dramatically cheaper.

although apparently only the max subscription includes glm-5

Arc agi was never a good benchmark that tested spatial understanding more than reasoning. I'm glad it's no longer popular

What do you mean? It definitely tests reasoning as well, and if anything, I expect spatial and embodied reasoning to become more important in the coming years, as AI agents will be expected to take on more real world tasks.

spatial or not, arc-agi is the only test that correlates to my impression with my coding requests

I am using it with Claude Code and so far so good. Can't tell if it's as good as Opus 4.6 or not yet

While GLM-5 seems impressive, this release also included lots of new cool stuff!

> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.

A new type of model has joined the series, GLM-5-Coder.

GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.

Looks like they also released their own agentic IDE, https://zcode.z.ai

If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.

Very fascinating stuff!

So that was pony alpha (1). Now what's Aurora Alpha?

(1) https://openrouter.ai/openrouter/pony-alpha

Certainly seems to remember things better and is more stable on long running tasks.

benchmark and pricing made me realize how good kimi 2.5 is. im an opus 4.6 person but wow, its almost 5x cheaper.

If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :

Claude Opus 4.6: 65.5%

GLM-5: 62.6%

GPT-5.2: 60.3%

Gemini 3 Pro: 59.1%

They increased their prices substantially

I predict a new speculative market will emerge where adherents buy and sell misween coded companies.

Betting on whether they can actually perform their sold behaviors.

Passing around code repositories for years without ever trying to run them, factory sealed.

we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated

I wish China starts copying Demis' biotech models as well soon

It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.

Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

Now this is the test that matters, cheers Simon.

How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.

I expect that the reason for their existence is political rather than financial (though I have no idea how that's structured.)

It's a big deal that open-source capability is less than a year behind frontier models.

And I'm very, very glad it is. A world in which LLM technology is exclusive and proprietary to three companies from the same country is not a good world.

Tim Dettmers had an interesting take on this [1]. Fundamentally, the philosophy is different.

>China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI.

https://timdettmers.com/2025/12/10/why-agi-will-not-happen/

maybe being in China gives them advantage of electricity cost, which could be big chunk of bill..

> it's comparing to last generation models (Opus 4.5 and GPT-5.2).

If it's anywhere close to those models, I couldn't possibly be happier. Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.

[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...

[3] https://www.reuters.com/world/china/chinas-customs-agents-to...

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

[1] https://github.com/rusiaaman/chat.md

I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.

What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

How do you read this?

[1] https://imgur.com/a/EwW9H6q

GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.

Edit: Input tokens are twice as expensive. That might be a deal breaker.

Why are we not comparing to opus 4.6 and gpt 5.3 codex...

Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.

If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.

Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.

The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0

I tried GLM 5 by API earlier this morning and was impressed.

Particularly for tool use.

Synthetic is a bless when it comes to providing OSS models (including GLM), their team is responsive, no downtime or any issue for the last 6 months.

Full list of models provided : https://dev.synthetic.new/docs/api/models

Referal link if you're interested in trying it for free, and discount for the first month : https://synthetic.new/?referral=kwjqga9QYoUgpZV

yes, plenty of good convo over there, the two should probably be merged

> I think GPT-5.3-Codex was a disappointment

Care to elaborate more?

> Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

Before you get too excited, GLM-4.7 outperformed Opus 4.5 on some benchmarks too - https://www.cerebras.ai/blog/glm-4-7 See the LiveCodeBench comparison

The benchmarks of the open weights models are always more impressive than the performance. Everyone is competing for attention and market share so the incentives to benchmaxx are out of control.

Can't they just run the output through a compiler to get feedback? Syntax errors seem easier to get right.

the new meta is purchasing rl environments where models can be self-corrected (e.g. a compiler will error) after sft + rlhf ran into diminishing returns. although theres still lots of demand for "real world" data for actually economically valuable tasks

minimax-m.2 is close

Exactly. The emperor has no clothes. The largest investments in US tech in history and yet there less than a year of moat. OpenAI or Anthropic will not be able to compete with Chinese server farms and so the US strategy is misplaced investments that will come home to roast.

And we will have Deepseek 4 in a few days...

To be fair, the US ban on Nvidia chip exports to China began under the Biden administration in 2022. By the time Trump took office, it was already too late.

US Secretary of State Bressent just publicly said that the US needs to get along and cooperate with China. His tone was so different than previously in the last year that I listened to the video clip twice.

Obviously for the average US tax payer getting along with China is in our interests - not so much our economic elites.

I use both Chinese and US models, and Mistral in Proton’s private chat. I think it makes sense for us to be flexible and not get locked in.

> What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips

Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.

> US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips

Sort of. It's all a front. On both sides. China still ALWAYS had access to Nvidia chips - whether that's the "smuggled" 1s or they run it in another country. It's not costing Nvidia much. The opening of China sales for Nvidia likewise isn't as much of a boon. It's already included.

> At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China

Again, it's a front. It's about news and headlines. Just like when China banned lobsters from a certain country, the only thing that happened was that they went to Hong Kong or elsewhere, got rebadged and still went in.

We can conclude that they ll flood the world with huawei inference chips from Temu and create worldwide AI pollution

I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?

Could also be the provider that is bad. Happens way too often on OpenRouter.

Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.

I kinda feel like the goalposts are shifting. While we're not there yet, in a world where Chinese models surpass Western ones, HN will be nitpicking edge cases long after the ship sails

Intelligence per token doesn't seem quite right to me.

Intelligence per <consumable> feels closer. Per dollar, or per second, or per watt.

GLM-5 at FP8 should be similar in hardware demands to Kimi-K2.5 (natively INT4) I think. API pricing on launch day may or may not really indicate longer term cost trends. Even Kimi-K2.5 is very new. Give it a whirl and a couple weeks to settle out to have a more fair comparison.

It seems to be much better at first pass tho. We'll see how real costs stack up

I feel like you're over reacting.

They're comparing against 5.2 xhigh, which is arguably better than 5.3. The latest from openai isn't smarter, it's slightly dumber, just much faster.

I honestly feel like people are brainwashed by anthropic propaganda when it comes to claude, I think codex is just way better and kimi 2.5 (and I think glm 5 now) are perfectly fine for a claude replacement.

I think the only advantage that closed models have are the tools around them (claude code and codex). At this point if forced I could totally live with open models only if needed.

The tooling is totally replicated in open source. OpenCode and Letta are two notable examples, but there are surely more. I'm hacking on one in the evenings.

OpenCode in particular has huge community support around it- possibly more than Claude Code.

GLM works wonderfully with Claude, just have to set some environment variables and you're off to the races.

If tooling really is an advantage why isn't it possible to use the API with a subscription and save money?

come on guys, you were using Opus 4.5 literally a week ago and don't even like 4.6

something that is at parity with Opus 4.5 can ship everything you did in the last 8 weeks, ya know... when 4.5 came out

> something that is at parity with Opus 4.5

You're assuming the conclusion

The previous GLM-4.7 was also supposed to be better than Sonnet and even match or beat Opus 4.5 in some benchmarks ( https://www.cerebras.ai/blog/glm-4-7 ) but in real world use it didn't perform at that level.

You can't read the benchmarks alone any more.

Today's meme was this question: "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?"

Here is Claude's answer just right now:

Here is ChatGPT, also right now:

"Since it’s only *50 meters* (about 160 feet), walking makes a lot more sense.

Here’s why:

The only time driving might make sense is if:

* The weather is terrible (heavy rain, snow, extreme cold). * You have mobility issues. * The car wash requires you to stay in the vehicle and you don’t want to leave it there.

Otherwise, this is a perfect “walk it over” situation. "

I find 5.3 very impressive TBH. Bigger jump than Opus 4.6.

Still 1/10th the cost of Opus 4.5 or Opus 4.6 when paying by the token.

It's roughly three times cheaper than GPT-5.2-codex, which in turn reflects the difference in energy cost between US and China.

Gemini 3 Flash ("Fast" in the web app) did not have trouble with this:

Since you presumably want the car to be clean, the answer depends entirely on how the car is getting washed. The Breakdown If it’s a Professional Car Wash: You’ll need to drive. Bringing the car to the wash is generally the prerequisite for them washing it. Plus, walking 50 meters just to ask them to come pick up a car that is visible from their front door might result in some very confused looks from the staff. If it’s a Self-Service Pressure Washer: Again, drive. Dragging a 1,500kg vehicle 50 meters by hand is an incredible workout, but probably not the "quick chore" you had in mind. If you are hiring a Mobile Detailer: Walk. Stroll over there, tell them where you're parked, and walk back home to watch TV while they do the work in your driveway.

GLM 4.7: "You should drive.

Since the goal is to get your car washed, the car needs to be at the car wash. If you walk, you will arrive at the car wash, but your car will still be sitting at home"

Are you sure that question is from this year?

This is really just a meme. People don't know how to use these tools.

Here is the response from Gpt-5.2 using my default custom instructions in the mac desktop app.

OBJECTIVE: Decide whether to drive or walk to a car wash ~50 meters from home, given typical constraints (car must be present for wash).

APPROACH: Use common car-wash workflows + short-distance driving considerations (warm engine, time, parking/queue). No web sources needed (stable, local decision).

OUTPUT: • Default choice: Drive. A car wash generally requires the car on-site, and 50 meters of driving is trivial. • When walking makes sense: • You’re going to drop the car off (hand-wash/detail) and pick it up later. • You want to avoid moving the car now (ice, tight parking, you’ve been drinking, etc.) and someone else will handle the car. • Micro-guidance: • If it’s an automatic wash you stay in, drive. • If it’s detail/hand wash with a wait, drive there, then walk home (or just wait).

CONFIDENCE: 92/100 Assumptions: It’s a normal car wash that requires your car to be there; conditions are safe to drive 50 meters.

Some snippets from Kimi's 2.5 answer:

"" [...] Since you need to get your car washed, you have to bring the car to the car wash—walking there without the vehicle won't accomplish your goal [...] If it's a self-service wash, you could theoretically push the car 50 meters if it's safe and flat (unusual, but possible) [..] Consider whether you really need that specific car wash, or if a mobile detailing service might come to you [...] """

Which seems slightly (unintentionally) funny.

But to be fair all the Gemini (including flash) and GPT models I tried did understand the quesiton.

Doesn't seem to be the case, gpt 5.2 thinking replies: To get the car washed, the car has to be at the car wash — so unless you’re planning to push it like a shopping cart, you’ll need to drive it those 50 meters.

If you're asking simple riddles, you shouldn't be paying for SOTA frontier models with long context.

This is a silly test for the big coding models.

This is like saying "all calculators are the same, nobody needs a TI-89!" and then adding 1+2 on a pocket calculator to prove your point.

I just ran this with Gemini 3 Pro, Opus 4.6, and Grok 4 (the models I personally find the smartest for my work). All three answered correctly.

Gemini Pro:

You should definitely drive.

If you walk there, your car will still be dirty back at your house! Since the goal is to get the car washed, you have to take it with you.

PS fantastic question!

I tried this prompt with all the major AI models that I know and have installed, and only GLM from Z.ai and Gemini 3 Flash could answer it. Even in that case, Gemini 3 Flash gave a bit more nuanced answer, but every other model like Claude Opus 4.5, Chat GPT, Grok - everything failed.

It's unclear where the car is currently from your phrasing. If you add that the car is in your garage, it says you'll need to drive to get the car into the wash.

The Pro and Max plans can use it. Pro has 1 concurrent session.

1. electricity costs are at most 25% of inference costs so even if electricity is 3x cheaper in china that would only be a 16% cost reduction.

2. cost is only a singular input into price determination and we really have absolutely zero idea what the margins on inference even are so assuming the current pricing is actually connected to costs is suspect.

It reflects the Nvidia tax overhead too.

I spent $10 in 2 minutes with that and gave up

spatial or not, arc-agi is the only test that correlates to my impression with my coding requests

minimax-m.2 is close

To be fair, the US ban on Nvidia chip exports to China began under the Biden administration in 2022. By the time Trump took office, it was already too late.

And we will have Deepseek 4 in a few days...

GLM works wonderfully with Claude, just have to set some environment variables and you're off to the races.

Gemini 3 Flash ("Fast" in the web app) did not have trouble with this:

> something that is at parity with Opus 4.5

You're assuming the conclusion

You can't read the benchmarks alone any more.

GLM 4.7: "You should drive.

Since the goal is to get your car washed, the car needs to be at the car wash. If you walk, you will arrive at the car wash, but your car will still be sitting at home"

Are you sure that question is from this year?

Some snippets from Kimi's 2.5 answer:

Which seems slightly (unintentionally) funny.

But to be fair all the Gemini (including flash) and GPT models I tried did understand the quesiton.

Gemini Pro:

You should definitely drive.

If you walk there, your car will still be dirty back at your house! Since the goal is to get the car washed, you have to take it with you.

PS fantastic question!

Now this is the test that matters, cheers Simon.

I expect that the reason for their existence is political rather than financial (though I have no idea how that's structured.)

It's a big deal that open-source capability is less than a year behind frontier models.

And I'm very, very glad it is. A world in which LLM technology is exclusive and proprietary to three companies from the same country is not a good world.

although apparently only the max subscription includes glm-5

Yes, thank you for pointing that out. It's probably load management thing.

Can't they just run the output through a compiler to get feedback? Syntax errors seem easier to get right.

The difference is in scaling. The top US labs have oom more compute available than chinese labs. The difference in general tasks is obvious once you use them. It used to be said that open models are ~6mo behind SotA a year go, but with the new RL paradigm, I'd say the gap is growing. With less compute they have to focus on narrow tasks, resort to poor man's distillation and that leads to models that show benchmaxxing behavior.

That being said, this model is MIT licensed, so it's a net benefit regardless of being benchmaxxed or not.

They do. Pretty much all agentic models call linting, compiling and testing tools as part of their flow.

> Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

Before you get too excited, GLM-4.7 outperformed Opus 4.5 on some benchmarks too - https://www.cerebras.ai/blog/glm-4-7 See the LiveCodeBench comparison

The benchmarks of the open weights models are always more impressive than the performance. Everyone is competing for attention and market share so the incentives to benchmaxx are out of control.

Sure. My sole point is that calling Opus 4.5 and GPT-5.2 "last generation models" is discounting how good they are. In fact, in my experience, Opus 4.6 isn't much of an improvement over 4.5 for agentic coding.

I'm not immediately discounting Z.ai's claims because they showed with GLM-4.7 that they can do quite a lot with very little. And Kimi K2.5 is genuinely a great model, so it's possible for Chinese open-weight models to compete with proprietary high-end American models.

Yeah, I'm sure closed source model vendors are doing everything within their power to dumb down benchmarks, so they can look like underdogs and play a pity game against open weight models.

Let's have a serious discussion. Just because Claude PR department coined the term benchmaxxing, we we should not be using it unless they shell out some serious monetes.

Obviously for the average US tax payer getting along with China is in our interests - not so much our economic elites.

I use both Chinese and US models, and Mistral in Proton’s private chat. I think it makes sense for us to be flexible and not get locked in.

>His tone was so different than previously in the last year that I listened to the video clip twice.

US bluff got called. A year back it looked like US held all the cards and could squeeze others without negative consequences. i.e. have cake and eat it too

Since then: China has not backed down, Europe is talking de-dollarization, BRICS is starting to find a new gear on separate financial system, merciless mocking across the board, zero progress on ukraine, fed wobbled, focus on gold as alternate to US fiat, nato wobbled, endless scandals, reputation for TACO, weak employment, tariff chaos, calls for withdrawal of gold from US's safekeeping, chatter about dumping US bonds, multiple major countries being quite explicit about telling trump to get fucked

Not at all surprised there is a more modest tone...none of this is going the "without negative consequences" way

>Mistral in Proton’s private chat

TIL

The tooling is totally replicated in open source. OpenCode and Letta are two notable examples, but there are surely more. I'm hacking on one in the evenings.

OpenCode in particular has huge community support around it- possibly more than Claude Code.

I know, I use OpenCode daily but it still feels like it's missing something - codex in my opinion is way better at coding but I honestly feel like that's because OpenAI controls both the model and the harness so they're able to fine tune everything to work together much better.

It's there now, `opencode models --refresh`

If tooling really is an advantage why isn't it possible to use the API with a subscription and save money?

In my opinion it is because if you control both the model and the harness then you're able to tune everything to work together much better.

This is really just a meme. People don't know how to use these tools.

Here is the response from Gpt-5.2 using my default custom instructions in the mac desktop app.

OBJECTIVE: Decide whether to drive or walk to a car wash ~50 meters from home, given typical constraints (car must be present for wash).

APPROACH: Use common car-wash workflows + short-distance driving considerations (warm engine, time, parking/queue). No web sources needed (stable, local decision).

CONFIDENCE: 92/100 Assumptions: It’s a normal car wash that requires your car to be there; conditions are safe to drive 50 meters.

If you're asking simple riddles, you shouldn't be paying for SOTA frontier models with long context.

This is a silly test for the big coding models.

This is like saying "all calculators are the same, nobody needs a TI-89!" and then adding 1+2 on a pocket calculator to prove your point.

I just ran this with Gemini 3 Pro, Opus 4.6, and Grok 4 (the models I personally find the smartest for my work). All three answered correctly.

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

maybe being in China gives them advantage of electricity cost, which could be big chunk of bill..

Tim Dettmers had an interesting take on this [1]. Fundamentally, the philosophy is different.

>China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI.

https://timdettmers.com/2025/12/10/why-agi-will-not-happen/

1. electricity costs are at most 25% of inference costs so even if electricity is 3x cheaper in china that would only be a 16% cost reduction.

The Pro and Max plans can use it. Pro has 1 concurrent session.

I spent $10 in 2 minutes with that and gave up

We can conclude that they ll flood the world with huawei inference chips from Temu and create worldwide AI pollution

Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?

I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

It seems to be much better at first pass tho. We'll see how real costs stack up

I feel like you're over reacting.

They're comparing against 5.2 xhigh, which is arguably better than 5.3. The latest from openai isn't smarter, it's slightly dumber, just much faster.

Your objective has explicit instruction that car has to be present for a wash. Quite a difference from the original phrasing where the model has to figure it out.

"You're holding it wrong."

Hacker Times

Hacker Times

GLM-5: From Vibe Coding to Agentic Engineering

Discussion

Discussion