GLM 5.2 vs. Opus

I seriously dont' know all this big hullabaloo about one shot prompting.

by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.

I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.

I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.

These are way more valuable metrics than "hey build X"

I feel like another comparison worth looking at is purely cost.

Capability per dollar is something I care about:

    Opus API    $5/$25
    Sonnet API  $5/$15
    Haiku API   $1/$5

    GLM 5.2 API $1.4/$4.4

So you're really getting near opus level capability for the price of haiku.

> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).

I was never able to get these models to collaborate with me the way Opus does. I'm probably an outliner, I don't one-shot projects, I don't vibe code. I basically use LLMs are if I was working with a coworker, fairly smart one, but with short memory and often missing the big picture. Sometimes I can delegate more, sometimes less, but I know I always have to stay on top of what's happening, because it WILL create mess when it hits something hard. With the Antropic models, this kind of cooperation is easy (with the exception of Opus 4.6, which was bad for some reason).

I've been checking out GLM 5.2 on some projects and few thoughts on it:

- it takes it sweet time to get code rolling, not the fastest model by any means

- it strays a lot during discovery/planning but then corrects

- it's not steering friendly, as it hallucinates things that it doesn't follow later on

- its output is quite good

A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.

GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.

Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.

I would opt in in using it more BUT GPT usually completes same requests 5x faster.

GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).

I’m actually amazed at the output since GLM doesn’t have eyes. If GLM 5.2 costs 1/5 as much, seems like it could be set up to reach out to a multimodal model for vision tasks when required. Closer to parity but probably still significantly cheaper.

One nice thing about GLM is that it has never refused a task. I'm working on a website that renders countries right now, and Anthropic's models regularly give me the old "This request triggered safety guardrails."

I'm not sure what exactly triggers it, but it seems to happen when it has to look at lists of countries. I suspect there must be at least one country name that triggers the safety guardrail.

You'd expect GLM to balk at something like Taiwan, but so far, it hasn't.

People are looking for ways not to burn through their premium subs when in many cases all you have to do is move down to 5.4-mini codex and it will probably solve your issue while barely touching your 5 hour or weekly limits.

No one has really talked about hybrid and using Opus to plan and orchestrate GLMs work both through initial build and code reviews. That’s a true best of both worlds and there doesn’t need to be a winner.

> Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.

I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?

I've signed up with Ollama to experiment with these open source models. For the past 3 months, it's just been experimenting, trying it out. GLM is the first model that I am using on a daily basis to do my coding work (as well as using Claude). It's good - I've been maxing out my Ollama usage limits everyday :)

So GLM emits fewer tokens and does fewer tool calls, but still takes over twice as long to complete.

Can someone explain to me where that time usage is coming from if not from the model operation itself?

Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?

GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game

This implies Opus was potentially much (?) better value.

GLM cost a quarter but Opus was twice as fast. So we are already at GLM actually costing half when you compare on time, without even considering the extra effort and time it would take to get Opus-par results.

It's good to have cheaper options and very impressive to see the Chinese continue to set open standards in this field, but the article is maybe a little over-generous.

"GLM-5.2 hit a problem here, because it can't read images. It isn't multimodal. So instead of looking at a screenshot, it fell back on a hacky workaround: it wrote scripts to read the raw pixel data and check whether the colors came out roughly as expected."

A better way would be to use https://github.com/openbmb/MiniCPM-V

Check out my comparison too, it has some not-really-benchmarks too (between any two models actually, SVG generation test and CSS animation test):

https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...

I've been using GLM 5.2 extensively for the last few days. It is slower, and the lack of multimodality is a bummer.

But, it produces solid results for a fraction of the price. Worth checking out if you have the time.

One of my goto "tests" of a new frontier models is having it rebuild a programming language from scratch. For GLM 5.2 I had it rebuild the old Rebol language in Rust:

https://github.com/mhs/rebol-clone-glm-5.2

It did a fairly good job roughing in the language for a low token cost.

I know that running this locally is prohibitively expensive (for now), but what kind of cost would I be looking at if I wanted to rent the hardware and run the model by myself?

this comparison seems kind of pointless if one model has vision and the other doesn't. obviously a model that can see is going to beat a blind model at making a video game.

It is insane that we are comparing locally-hostable models to leading cloud providers, it is wild to me that this article even exists.

We have come a long way, and very clearly have a long way yet to go.

> GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game.

Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.

GLM 5.2 has one big issue that will limit its meaningful success and that's the value of their coding subscription.

Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.

At the same time, none of these companies will use a Chinese API for their employees.

For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.

We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.

So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.

To me one shot prompting is as relevant as Strava's KOM is for cycling, i'm more interested in a good cycling performance after a 3 hours ride than a straight up 30 min record effort.

So the benchmark is : Two models with different harness produced very different results .

Glm game was completely broken Opus game was at first glance ok but also with bugs

Different models with different cost produced different non perfect results . How is it “close” ? :)

Also on costs : glm burns more tokens on average vs opus . Gpt5.5 burns less surprisingly

there is no comparison between glm 5.2 and opus. First for this glm 5.2 you need a big big resource and that big also came from money so instead you buy the opus subscription and enjoy.

These style of comparisons are decent at showing capability but they don't really show me what I truly want - a sounding board and implementer with senior engineer-level execution. When I look back at all the teams that I've been part of, the best outcomes came from white-boarding (sometimes in the metaphorical sense) with one or two people, at times arguing, then finally compromising on a plan. Instead of synthetic benchmarks that try to be objective, I wonder if there's a way test this, or maybe I'm opining on a way of working that will soon be gone?

GLM-5.2 is quietly becoming the most interesting open model release this year. The coding benchmarks are surprisingly close to frontier models at a fraction of the inference cost.

>On output tokens, GLM-5.2 is less than a fifth the price of Opus.

Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

Cost difference matters most as cost optimization is the whole point of AI. Time difference (30 min vs 1 hr) is not a deal-breaker. The small precision gap on the first iteration does not matter for 99% of the work that happens in real world.

Pretty clearly it's beating Opus at [web dev](https://www.gptbased.com/) - on price, on score.. I mean what else is there?

You should repeat this experiment but with progressively more detail in the initial prompt. Claude's secret sauce is taking weakly specified prompts and making passable things from them, but as the degrees of freedom in the prompt go down Claude starts to disobey while other models close in on the intent.

I was surprised today by how much better GLM-5.2 was than GPT-5.5 at aesthetic/UI work. I'll keep my Claude/Codex setup via Conductor for now, but this model got me to set up OpenCode, download their desktop app and do most of my work there today.

How are people running this locally? I just checked llama.cpp and it appears unsloth has a version but it hacks a bunch of things to make it work and isn't optimal.

https://github.com/ggml-org/llama.cpp/issues/24730

My understanding was that n-shot prompting just referred to the number of examples included in a prompt, not the number of prompts to achieve the desired result.

"Build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library" would be a zero-shot prompt.

I used GLM 5.0/5.1/5.2 for some projects, and for me, the area in which they lag behind frontier models the most are user interfaces. They get really close to Opus when it comes to pure algorithms, but when I need something like web application or a mobile app that looks and works well, they are very noticeably worse than even Sonnet.

I signed up for GLM 5.2 yesterday to try it out because Anthropic kept throwing 529 Overloaded

I like it, but the lite plan ate 22% usage of my 5h reset window in a single session after 2 prompts on xhigh of GLM 5.2 [1m]

Result was satisfactory, I think stuff is decent, I'm happy to use either, wish there was a combined subscription plan where I could get both

I've seen glm 5.2 struggle writing simple compilable c code. It might be good at web, but it's world knowledge is limited due to the small model size, making it's use quite limited in my opinion.

> 256 GiB unified RAM.

So, 8000$, plus it's unavailable. 3 years of Codex/Opus subscription.

> API prices

Which are irrelevant for 200$ Codex/Opus plans that are times cheaper.

What would the best way to use these open source models for a price similar to what I could pay for the cheapest plan with claude and openai ?

I would like to give them a try but I certainly not have the money to get a system able to run them, and I don't really want to pay more than the state of the art

i think inference is the thing, that also fast inference, so enterprises can just host their own and run, ig vercel do it, many more would. but zs it thinks toooo much idk how fast we can make it.

Great article,

My only, I guess feedback, is that it's not really clear about the price.

Would the 21.92 be the API pricing I guess?

Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)

Are these games supposed to be a good example of quality output? If this is the product, I don't really want to play _either_ of them.

Totally agree witg the general assessment. The biggest problem with Z.ai model for a long time is not quality, but the inference speed and general capacity availability. Hopefully with this recent hype, there will be more provider on openrouter for 5.2.

Having issues with coding a render for good looking realistic smoke coming off burning incense, opus 4.8 & gpt-5.5 both have code issues, glm-5.2 did it. Amazing.

The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.

The text only part is the catch for me.

If it builds a UI and can't look at it, it's askin ls whether the app looks right.

I wonder how much tokens and time where used for the verifying part. Maybe GLM 5.2 instantly found the "solution" to read the screen pixel by pixel, but it could also have been a major token and time consumer.

Still on a z.ai legacy plan and their 50% discount for switching to standard plans tips the balance for me. So I guess I’ll reevaluate round about beginning 2028…

not apples to apples. comparing official vs. pi.dev+openrouter and having slow times is more a openrouter issue. try comparing using official z.ai.

glm-5.2 is very good if you have a good harness and workflow to use it with. in fact, i'd call it good enough if you are a software engineer who knows what you want. it writes the code. i'm wondering if i need anthropic's models at all at this point, or openai. and surely in a year we won't need them at all. Opus 4.5+ was the turning point for me, and now these open models are just as good. i don't get how you IPO these companies when their only winning product is coding agents and the competition is just as good for 1/4 the price.

> Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.

This framing local LLMs as free is stupid. Basically pay 100+ months worth of API costs up front isn't free in the slightest. And it will be slower than non-local, your hardware will be outdated in 12 months and probably won't be able to run SOTA at anywhere near non-local speed in max 20 months

The price of a small house.

I've seen glm 5.2 struggle writing simple compilable c code. It might be good at web, but it's world knowledge is limited due to the small model size, making it's use quite limited in my opinion.

What would the best way to use these open source models for a price similar to what I could pay for the cheapest plan with claude and openai ?

I would like to give them a try but I certainly not have the money to get a system able to run them, and I don't really want to pay more than the state of the art

Great article,

My only, I guess feedback, is that it's not really clear about the price.

Would the 21.92 be the API pricing I guess?

Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)

Having issues with coding a render for good looking realistic smoke coming off burning incense, opus 4.8 & gpt-5.5 both have code issues, glm-5.2 did it. Amazing.

The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.

Still on a z.ai legacy plan and their 50% discount for switching to standard plans tips the balance for me. So I guess I’ll reevaluate round about beginning 2028…

I seriously dont' know all this big hullabaloo about one shot prompting.

I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.

These are way more valuable metrics than "hey build X"

> So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch

Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.

Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task.

I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

On the other hand, I did just leave my pi agent running GPT 5.5 overnight on a clearly defined, long running task. It's been running about 10 hours now and it's mostly done. So this kind of use case is also valid.

Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.

sure that's why we look at a mix of formal benchmarks, one longer analysis of a side-by-side, and various other people who we trust to form an opinion, all covered in the article - not intended to be a formal benchmark, there are enough of those.

Totally agree, a single one-shot prompt can't prove anything.

I've been checking out GLM 5.2 on some projects and few thoughts on it:

- it takes it sweet time to get code rolling, not the fastest model by any means

- it strays a lot during discovery/planning but then corrects

- it's not steering friendly, as it hallucinates things that it doesn't follow later on

- its output is quite good

A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.

Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.

I would opt in in using it more BUT GPT usually completes same requests 5x faster.

GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).

Also pricing, I wanted to give a try, but when pricing is only 30% cheaper than Opus, I wouldn't go for it with these issues.

This mirrors my experience. I have been using it in Pi. It is smart and output is good but it is not efficient in getting there.

Yeah, it glosses over a gigantic capital expenditure. It's sort of like saying that an open source modern CPU architecture allows you to build your own CPU "for free" (provided that you own and operate a fab).

True. But there are other meanings of "free". I.e. nobody can say "from now on you no longer have access to model X because you're an asshole"

Cool to hear, what kind of tasks have you been using GLM for? And what other models have you found useful through Ollama?

So GLM emits fewer tokens and does fewer tool calls, but still takes over twice as long to complete.

Can someone explain to me where that time usage is coming from if not from the model operation itself?

Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?

I have noticed that Opus and GPT 5.5 are very good at adjusting their thinking / reasoning intensity depending on the task at hand, something the open weights models are still not as good at.

In addition to that, some of the open weights models like GLM 5.2 or DeepSeek v4 Pro tend to be MUCH slower when generating tokens, which contributes to the perceived slowness. Although I wouldn't call models like GLM 5.2 slow by any means, e.g. it is currently one of the fastest models inside Notion today.

Could just be infra. I'm betting Anthropic is much better prepared.

> GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game.

Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.

This is excellent feedback thank you! These LLMisms in writing are a challenge I am living with currently and trying to improve on. The technical writing industry is taking a huge knock right now with companies demanding more work in less time with a big drop in quality, day to day I get less and less time to work on the quality in the prose of my work. We are working at the frontier of this right now, so we are the most heavily effected, but also get to experiment with the changes first which can be both stimulating and very frustrating.

I think a bunch of real humans started to adopt the LLMs writing style.

Yes, and it's really grating. It's like half of all new writing is done in the same "voice" now.

GLM 5.2 has one big issue that will limit its meaningful success and that's the value of their coding subscription.

At the same time, none of these companies will use a Chinese API for their employees.

We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.

So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.

My impression is that individual subscriptions are the loss leading hook. The money is made on Enterprise token contracts.

Employees and students used to coding with thousands of dollars worth of tokens (on a 20/100 dollar plan) will push enterprise to spend.

Having a Chinese model that is competitive won't displace this enterprise spend. But an open model hosted in the US/EU might.

The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.

GLM-5.2 is quietly becoming the most interesting open model release this year. The coding benchmarks are surprisingly close to frontier models at a fraction of the inference cost.

We've had the great small Qwen 3.6 early April that many could actually run on their laptop. Then similar from Google a few weeks later (Gemma4, better in prose, worse in code). Then the super cheap large Deepseek V4 a few weeks later. Then antirez DS4 build that made that actually runnable on MacBooks and Mac Studios. And now the "near-frontier / near-Opus" GLM 5.2.

For people who follow open LLMs, none of these were quiet and all were the most interesting open model release for a few days/weeks. In one or two months, it will be some other model again. Now I do appreciate the real rapid improvements in open models. But there's also a ton of hype and fast-fashion around all of this.

To me DS 4 is still the most interesting due to much lower costs. Also DS 4 training isn't done yet.

From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:

- on the 16 tasks, one needed several prompts to be steered back into the topic

- its review capabilities seem much worse

- DS4 had the cleanly better solution in 3 cases out of 16, with Opus "only" doing cleanly better 2 times out of 16. But still, I want to emphasize, is the worst case scenarios that imho matter the most, not the best ones, and on that front Opus outperformed.

That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.

>On output tokens, GLM-5.2 is less than a fifth the price of Opus.

There is, for example, OpenCode Go subscription, which for $10 a month gives you a decently generous quota of GLM-5.2, among other models.

And z.ai themselves also have subscriptions.

> For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

https://z.ai/subscribe

I’m currently trying to figure out whether a downgrade from Max 5x to Pro in combination with one of those would save me money and if so, how much.

Edit: seems like Anthropic Pro + GLM Pro (Yearly) would let me almost halve my costs of Anthropic Max 5x. Only concerns are about GLM 5.2 not having vision support and also being kinda slower and also not being as good as Opus.

Yes this is true. This test was run on a $20 pro Claude subscription. I would definitely love to try use both models on the highest plans for a whole month and compare the two, great format for a future head-to-head comparison.

Is it fair when the one is heavily subsidized and the other one is not?

I think it's most fair to compare the plain token pricing that is used by everyone.

GLM has subscription plans too.

Yes I 100% agree. Time-taken can be improved (with harnesses, subagent workflows etc.) and varies based on task.

Pretty clearly it's beating Opus at [web dev](https://www.gptbased.com/) - on price, on score.. I mean what else is there?

Article states it's not multimodal. I guess that means for webdev it means you can't take a screenshot to indicate errors etc.

I hate to be that guy, but real privacy policy on training data/it being hosted somewhere where I'm not worried about secrets being stored/leaked.

Latency? Just saying there's other things to consider.

Hi, author here, I cannot give an exact number for how many token the verification step took, but the verification GLM 5.2 ran was very stupid and definitely a waste of time. It read the pixel color data to try and verify the scene rendered properly. Which is really bad. Opus opened the game in a Playwright browser and took screenshots to verify the actual image. Which helped a lot.

Pro tip: You could use a multi-modal model to verify images as a subagent spawned by GLM 5.2, to get around this issue.

I could be wrong but I believe this is a non-vision model. Please weigh in to correct me bc I would love to be wrong

If a model can take a series of increasingly complex instructions and satisfy the requirements without human intervention, we can pretty easily decide how well overall the model does. And, judging better models just means adding more requirements to a task. So, I think it's a useful method (Even if it's not a realistic use case).

Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)

It's true that no one is trying to one shot anything serious right now, but it's still an important metric. Claude Code and Opus really took off when they improved the harnessing enough that it would self-correct many of its mistakes without needing user input. In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.

On one hand, that's sort of true for practical uses - and benchmarks notoriously undercount multi-turn settings.

On another, being able to reliably tackle minor tasks with no handholding is very valuable in itself. Sometimes implementation details are important, but often, the most important thing is to Get It Done.

The argument is flawed, there is no logical reason to assume a single prompt won’t be sufficient to constitute the complexity of a software project. It may not be practical in many cases but there is too much variability in what is considered a complex software project and in the sufficiency of instruction in a single prompt to make that claim and say it’s “by definition.”

I think you're underestimating the elegance of "hey build X". It already captures a lot of what you're interested in.

Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.

Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.

Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.

> I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.

When the model produces reasonable results from one prompt, you could assume that it will also return reasonable results through the follow up prompts.

Blame anthropic, they decided to make these type of one-shot examples the primary focus of the Fable 5 release, and relegating benchmark scores to the pdf.

"We did multi-shot prompting to try and get these two games into comparable states using these two different models."

"Well obviously you provided better follow-up prompts to the one that came out better."

Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?

I think that’s the point of the Superpowers SKILL

On one hand, that's sort of true for practical uses - and benchmarks notoriously undercount multi-turn settings.

I think you're underestimating the elegance of "hey build X". It already captures a lot of what you're interested in.

Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.

Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.

Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.

Blame anthropic, they decided to make these type of one-shot examples the primary focus of the Fable 5 release, and relegating benchmark scores to the pdf.

When the model produces reasonable results from one prompt, you could assume that it will also return reasonable results through the follow up prompts.

The price of a small house.

I think that’s the point of the Superpowers SKILL

Cool to hear, what kind of tasks have you been using GLM for? And what other models have you found useful through Ollama?

I have noticed that Opus and GPT 5.5 are very good at adjusting their thinking / reasoning intensity depending on the task at hand, something the open weights models are still not as good at.

Could just be infra. I'm betting Anthropic is much better prepared.

I think a bunch of real humans started to adopt the LLMs writing style.

Indeed. I'm trying to develop a similar style. The phrasing in the quoted passage is really tight.

Yep, as I reread my own sentances I notice these LLMisms and have to rewrite them quite often. Reading so much llm-output definitely impacts your writing style.

Yes, and it's really grating. It's like half of all new writing is done in the same "voice" now.

My impression is that individual subscriptions are the loss leading hook. The money is made on Enterprise token contracts.

Employees and students used to coding with thousands of dollars worth of tokens (on a 20/100 dollar plan) will push enterprise to spend.

Having a Chinese model that is competitive won't displace this enterprise spend. But an open model hosted in the US/EU might.

The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.

> My impression is that individual subscriptions are the loss leading hook

Except there is no evidence of this at all, just people comparing API and subscription pricing. The leaked financial info for OpenAI shows inference is profitable right now, though it does not show a distinction between subscription and API revenue... but if subscription revenue was so lossy, it would hard for total inference to still be profitable.

To be clear, I agree with this and they have my unlimited support pushing for relevance of open source models. GLM 5.2 is amazing and I couldn't be more excited.

I just think that as of today, most people will not find a good reason to switch to GLM.

To me DS 4 is still the most interesting due to much lower costs. Also DS 4 training isn't done yet.

From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:

- on the 16 tasks, one needed several prompts to be steered back into the topic

- its review capabilities seem much worse

That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.

There is, for example, OpenCode Go subscription, which for $10 a month gives you a decently generous quota of GLM-5.2, among other models.

And z.ai themselves also have subscriptions.

to be exact, it gives you USD 60 of usage of open models.

> For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.

https://z.ai/subscribe

I’m currently trying to figure out whether a downgrade from Max 5x to Pro in combination with one of those would save me money and if so, how much.

Is it fair when the one is heavily subsidized and the other one is not?

I think it's most fair to compare the plain token pricing that is used by everyone.

> Is it fair when the one is heavily subsidized

As a consumer, yes, it's totally fair. All that matters to me is the price I pay at the pump, not whether that price is "real" or not.

Z.ai is also believed to be "subsidised". Its parent company is running at a large loss right now.

Anthropic have claimed they expect their first profitable quarter this year -- they may have bigger margins on their raw API than you realise.

GLM has subscription plans too.

Out of stock, unavailable

Yes I 100% agree. Time-taken can be improved (with harnesses, subagent workflows etc.) and varies based on task.

Article states it's not multimodal. I guess that means for webdev it means you can't take a screenshot to indicate errors etc.

I hate to be that guy, but real privacy policy on training data/it being hosted somewhere where I'm not worried about secrets being stored/leaked.

Open weights win on that front surely?

Realistically you’d need to rotate secrets anyway once it moves from dev to production regardless of model provider

Its on other providers, like Together.ai

Latency? Just saying there's other things to consider.

Pro tip: You could use a multi-modal model to verify images as a subagent spawned by GLM 5.2, to get around this issue.

I could be wrong but I believe this is a non-vision model. Please weigh in to correct me bc I would love to be wrong

GLM 5.2 is text only, not multi modal. And Opus is multi modal.

Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)

You seem to be missing the point of what parent is saying :)

To really evaluate how a model is to use in real life, it should have access to tools, and be able to iterate on something, like they do when you use them in an agent harness.

None of that iteration need necessarily to have a human driving it (although if you're building something you want to be able to maintain, you probably need a human driving the design and architecture), you can just let the model do a couple of tries and give it input into how it's doing, and you get something closer to how people use these models in reality.

> In fact I think long-term autonomy (in the range of several hours) and self-correcting is going to be where we see most improvements in coming years.

Right, model intelligence defines the scope of things they can one shot

I also suspect that users naturally calibrate to a model's useful scope, gradually getting positive/negative feedback and gradually making their requests bigger/smaller than before

it wont happen, its all a money grab.

And that prompt will basically be 2000 page spec Bible à la IBM circa 1960, see waterfall. Unless AI develops mindreading (and advanced mindreading at that), single prompt creation of actual complex software products will never happen. You'll one shot a simple non scientific calculator, but not Excel or Vim or Nginx.

One shot prompting/tooling is the only reasonable way to use an llm in my opinion. You should not be having an LLM operating for hours creating thousands of lines of new code that you can never review or maintain. You can actually be highly productive modifying a single file or two at a time, ideally as focused and little context as possible, without the llm being given full permission to add as much context as possible along the way to maximize revenue for the developers of the harness.

The agentic engineering paradigm is just a narrative trend pushed by AI companies to get people to 10x their token consumption per prompt. It plays into people's laziness and addiction to dopamine too causing addict like behavior in people that fall prey to this trend.

> I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.

Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.

It's not always possible, or at least trivial. For example how do you enforce "prefer to reuse existing code over making a copy"? Is there a static analysis tool that will detect two pieces of code that do the same thing?

I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard

Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt.

Appreciate you sharing the results of your tests though!

"We did multi-shot prompting to try and get these two games into comparable states using these two different models."

"Well obviously you provided better follow-up prompts to the one that came out better."

Sure, real-world usage is always more difficult to benchmark, but the additional issue with the one shot prompting benchmark is that by optimizing for it, models are nudged towards making all those assumptions they shouldn't really make. Maybe a better test would be to have a fully spec'd-out plan, but start with a one shot, high-level prompt and expect the agent to discover your preferences by repeatedly asking for clarifications. The system that manages to suss out more of the details in the hidden spec this way, in less steps and with less unnecessary questions would more likely to be a truly well-calibrated agent.