I haven't been keeping up on hardware costs for state of the art LLM inference, but this remark made me ask myself how many readers of the article would actually be able to run this model on hardware they own. How much would it cost to acquire such a setup?
I would like to give them a try but I certainly not have the money to get a system able to run them, and I don't really want to pay more than the state of the art
My only, I guess feedback, is that it's not really clear about the price.
Would the 21.92 be the API pricing I guess?
Cost $5.39 (real billed) ~$21.92 (estimate, list pricing)
The real time 3d fluid dynamics appear to be the tricky part, I wish I still had opus access, would love to see if it can do it.
by definition, a single prompt wont' constitute the complexity of a software project. ergo, what you'll get is a series of assumptions made by the model based on preexisting code in its training corpus.
I'd rather see a coding agent that can follow steps in a plan file to a T while following guardrails and adhering to the proper coding conventions in the human reviewed spec.
Id rather see performance in agent loops against human defined objectives where it can be verified to stick to defined guardrails and continue without drift till its objectives are complete.
I'd also like to see it identify bugs and potential performance increases by identifying existing code and suggesting refactors based on context it can pickup about the particular use case you are trying to create.
These are way more valuable metrics than "hey build X"
Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage.
Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best).
- it takes it sweet time to get code rolling, not the fastest model by any means
- it strays a lot during discovery/planning but then corrects
- it's not steering friendly, as it hallucinates things that it doesn't follow later on
- its output is quite good
A sample use case: I was optimizing rendering on Swift+Zig codebase. It chocked on 5k data entries.
GLM 5.2 spent 20 minutes building the benchmarks and getting data out, which made me frustrated so I blocked non-editing tool access and went AFK, after approx. 30 minutes I found that it used already-made benchmarks and some "conclusions" to optimize 3 choke points. Output pointed that it couldn't validate suspicions and asked for more data.
Implementation worked well, it was idiomatic and non-intrusive. I would even say that it was more idiomatic than GPT 5.5 effects on same repo.
I would opt in in using it more BUT GPT usually completes same requests 5x faster.
GLM 5.2 was spark for preparing and running inside isolated containers with JJ workspaces (so that multiple can be ran in parallel).
Can someone explain to me where that time usage is coming from if not from the model operation itself?
Are the individual tool calls more complex and take more time to complete? Or is the rate of tok/s lower because the model does more compute per token?
Off topic, but does anyone else instantly pick up on LLMisms like this? It seems like all the models have converged on this style of writing, and improvements aren't really changing it.
Yes, in terms of API pricing, GLM 5.2 outperforms the competition. But the only people that use API billing for their coding work are large corporations, where these highly subsidized subscriptions are being fazed out.
At the same time, none of these companies will use a Chinese API for their employees.
For individuals and smaller teams, Z.ai's coding subscription is outperformed by Anthropic and OpenAI. You probably get around the same usage with Claude, but Codex definitely offers more usage for the amount you pay.
We can have a debate how much Z.ai closed the gap to GPT5.5 and Opus 4.8, but if I can freely decide between them in a world where they all cost the same, I simply wouldn't choose GLM.
So the important question becomes: How good will the offering from Z.ai get with GLM 5.3 or 6 and how much will OpenAI and Anthropic cripple their current offering in the near future.
Opus is most expensive model in pay as you go model, but IMO fair comparison should include subscription price as well. For example when one has $100 Claude Max and use it up through the month, it might not be more expensive than GLM, or at least not 5x.
On another, being able to reliably tackle minor tasks with no handholding is very valuable in itself. Sometimes implementation details are important, but often, the most important thing is to Get It Done.
Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.
Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.
Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.
In addition to that, some of the open weights models like GLM 5.2 or DeepSeek v4 Pro tend to be MUCH slower when generating tokens, which contributes to the perceived slowness. Although I wouldn't call models like GLM 5.2 slow by any means, e.g. it is currently one of the fastest models inside Notion today.
Employees and students used to coding with thousands of dollars worth of tokens (on a 20/100 dollar plan) will push enterprise to spend.
Having a Chinese model that is competitive won't displace this enterprise spend. But an open model hosted in the US/EU might.
The existence of GLM 5.2 puts a ceiling on how much OpenAI/Anthropic can charge for API Access.
For people who follow open LLMs, none of these were quiet and all were the most interesting open model release for a few days/weeks. In one or two months, it will be some other model again. Now I do appreciate the real rapid improvements in open models. But there's also a ton of hype and fast-fashion around all of this.
From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:
- on the 16 tasks, one needed several prompts to be steered back into the topic
- its review capabilities seem much worse
- DS4 had the cleanly better solution in 3 cases out of 16, with Opus "only" doing cleanly better 2 times out of 16. But still, I want to emphasize, is the worst case scenarios that imho matter the most, not the best ones, and on that front Opus outperformed.
That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.
And z.ai themselves also have subscriptions.
I’m currently trying to figure out whether a downgrade from Max 5x to Pro in combination with one of those would save me money and if so, how much.
Edit: seems like Anthropic Pro + GLM Pro (Yearly) would let me almost halve my costs of Anthropic Max 5x. Only concerns are about GLM 5.2 not having vision support and also being kinda slower and also not being as good as Opus.
I think it's most fair to compare the plain token pricing that is used by everyone.
Pro tip: You could use a multi-modal model to verify images as a subagent spawned by GLM 5.2, to get around this issue.
Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)
Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.
I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.
"Well obviously you provided better follow-up prompts to the one that came out better."
Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?
To really evaluate how a model is to use in real life, it should have access to tools, and be able to iterate on something, like they do when you use them in an agent harness.
None of that iteration need necessarily to have a human driving it (although if you're building something you want to be able to maintain, you probably need a human driving the design and architecture), you can just let the model do a couple of tries and give it input into how it's doing, and you get something closer to how people use these models in reality.
Right, model intelligence defines the scope of things they can one shot
I also suspect that users naturally calibrate to a model's useful scope, gradually getting positive/negative feedback and gradually making their requests bigger/smaller than before
Except there is no evidence of this at all, just people comparing API and subscription pricing. The leaked financial info for OpenAI shows inference is profitable right now, though it does not show a distinction between subscription and API revenue... but if subscription revenue was so lossy, it would hard for total inference to still be profitable.
I just think that as of today, most people will not find a good reason to switch to GLM.
As a consumer, yes, it's totally fair. All that matters to me is the price I pay at the pump, not whether that price is "real" or not.
Anthropic have claimed they expect their first profitable quarter this year -- they may have bigger margins on their raw API than you realise.
The agentic engineering paradigm is just a narrative trend pushed by AI companies to get people to 10x their token consumption per prompt. It plays into people's laziness and addiction to dopamine too causing addict like behavior in people that fall prey to this trend.
Appreciate you sharing the results of your tests though!
Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks.
Similar to how ML was all the hype about 12 years ago and then it submerged again for a couple of years.
If I do that, I'm literally slower then just doing the change without sufficiently specifying it to the model.
I can see how a junior dev or generally someone that's not particularly knowledgeable about the language or framework they're working with may benefit from such usage, but for experienced people there is very little value in that approach.
I say this because I've just had to face this decision this month with Copilot introducing the usage based billing. I attempted to scale back my usage, first with non-opus - output essentially became discardable as it continually hallucinated no existing fields in the responses of Apis etc... Then my scoping the changes smaller and smaller, until I ultimately gave up and reduced usage to just generating tests.
https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...
But, it produces solid results for a fraction of the price. Worth checking out if you have the time.
One of my goto "tests" of a new frontier models is having it rebuild a programming language from scratch. For GLM 5.2 I had it rebuild the old Rebol language in Rust:
https://github.com/mhs/rebol-clone-glm-5.2
It did a fairly good job roughing in the language for a low token cost.
So, 8000$, plus it's unavailable. 3 years of Codex/Opus subscription.
> API prices
Which are irrelevant for 200$ Codex/Opus plans that are times cheaper.
Glm game was completely broken Opus game was at first glance ok but also with bugs
Different models with different cost produced different non perfect results . How is it “close” ? :)
Also on costs : glm burns more tokens on average vs opus . Gpt5.5 burns less surprisingly
"Build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library" would be a zero-shot prompt.
If it builds a UI and can't look at it, it's askin ls whether the app looks right.
Capability per dollar is something I care about:
Opus API $5/$25
Sonnet API $5/$15
Haiku API $1/$5
GLM 5.2 API $1.4/$4.4
So you're really getting near opus level capability for the price of haiku.I'm not sure what exactly triggers it, but it seems to happen when it has to look at lists of countries. I suspect there must be at least one country name that triggers the safety guardrail.
You'd expect GLM to balk at something like Taiwan, but so far, it hasn't.
This implies Opus was potentially much (?) better value.
GLM cost a quarter but Opus was twice as fast. So we are already at GLM actually costing half when you compare on time, without even considering the extra effort and time it would take to get Opus-par results.
It's good to have cheaper options and very impressive to see the Chinese continue to set open standards in this field, but the article is maybe a little over-generous.
We have come a long way, and very clearly have a long way yet to go.
A better way would be to use https://github.com/openbmb/MiniCPM-V
I like it, but the lite plan ate 22% usage of my 5h reset window in a single session after 2 prompts on xhigh of GLM 5.2 [1m]
Result was satisfactory, I think stuff is decent, I'm happy to use either, wish there was a combined subscription plan where I could get both
You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader.
GLM-5.2 just came out, and it's another step forward for what open models can do. The internet promptly freaked out, and it's hard to tell what's real and what's hype.
So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch. Here's our take after running the test and digging through the benchmarks and the buzz.
We're not switching our main off Opus. In our test Opus was faster and shipped a cleaner, more correct game, and it can check its own visual output, which the text-only GLM-5.2 can't. But GLM-5.2 earns a permanent spot in the arsenal: it's a genuinely capable model at a fraction of the price, and because it's open weights, it'll always be available. A closed model can be retired or restricted with little warning (Fable was a recent reminder); weights you can download can't be taken away.
You can play both games right now, or grab the source:
Both are browser games written from scratch, with no game engine or 3D rendering library like Three.js. The 3D models are free CC0 assets from Kenney.
Here's how the two runs compared:
| Metric | GLM-5.2 (Pi/OpenRouter) | Opus (Claude Code) |
|---|---|---|
| Wall-clock build time | 1h 10m 40s | 33m 30s |
| Output tokens | 131,000 | 216,809 |
| Peak context window | 16% of 1M | 19% of 1M |
| Tool calls | 128 | 153 |
| Cost | $5.39 (real billed) | ~$21.92 (estimate, list pricing) |
GLM-5.2 cost a fraction as much. Opus finished in half the time and shipped a cleaner game.
On paper, the benchmarks put GLM-5.2 just behind the top closed models, and the online buzz is a mix of genuine signal and astroturf. We get into both below, after the game.
GLM-5.2 is Z.ai's latest flagship model. It's open weights under an MIT license, so you can download it, run it yourself, or call it through Z.ai's API.
It's built for long-horizon tasks, the kind of long, multi-step coding-agent work that runs for hours. It ships with a 1M-token context window and two thinking effort levels, High and Max, that trade speed for capability.
note
GLM-5.2 is text-only, not multimodal. It can't read images, so workflows built around screenshots or diagrams still need a model like Claude Opus.
Z.ai positions it roughly between Claude Opus 4.7 and 4.8 at similar token usage. Here's their announcement, if you want to read more:
Because it's open weights, GLM-5.2 is cheap. Through an API it costs a fraction of Opus, and you can run it yourself for free if you have the hardware.
Pricing, per 1M tokens (vendor docs):
| Input | Cache read | Output | |
|---|---|---|---|
| Claude Opus 4.8 | $5 | $0.50 | $25 |
| GLM-5.2 | $1.4 | $0.26 | $4.4 |
On output tokens, GLM-5.2 is less than a fifth the price of Opus.
The weights are on Hugging Face and ModelScope under an MIT license, with no regional restrictions. You can serve it locally with frameworks like vLLM, SGLang, or Transformers.
To cut through the vibes, we gave Opus 4.8 and GLM-5.2 the same one-shot prompt: build a 3D platformer game from scratch, in raw WebGL, with no game engine or 3D library.
A model can zero-shot a good-looking landing page, and the community already discounts that as a test of much. A 3D platformer in raw WebGL can't be faked in one pretty file. It has real structure: a GLB model parser, matrix and vector math, GLSL shaders, skinned skeletal animation, a fixed-timestep loop, collision, a follow camera.
That structure tests both things people argue about at once. Holding a layered, multi-file build together over many steps is the agentic part, where GLM-5.2 is meant to be strong. Getting the engine internals right, the parts that look fine but quietly break, is the reasoning-and-taste part, where Opus is meant to pull ahead.
We bundled the 3D assets locally, so the test is the engine and the rendering, not whether the harness can fetch a model file. The art itself is a human-made asset pack, Kenney's CC0 Platformer Kit, and both agents were handed the identical files.
To finish, each model had to build:
Both did most of it by hand (by tool? by claw?): a GLB binary parser, the matrix and quaternion math, a WebGL2 renderer with GLSL skinning shaders, and substepped AABB collision to keep the character from tunneling through platforms.
Both got the same prompt, the same assets, and one attempt with no hints. We ran Opus 4.8 with extended thinking on high, and GLM-5.2 with thinking set to high (GLM-5.2 also has a higher Max tier we didn't use). You can dig into both runs yourself:
Opus 4.8 built in Claude Code; GLM-5.2 built in Pi over OpenRouter.

Side-by-side timelapse. Opus finishes at 34:00, GLM-5.2 at 1:11.
The timelapse shows the whole build compressed: Opus working through it in roughly half the wall-clock time, GLM-5.2 grinding longer but for far less money. The full numbers are in the results table at the top.
We played both games start to finish. Here's how each one held up.
Both built the same kind of game: a third-person 3D platformer with the same controls. You move with WASD or the arrow keys, jump with space, sprint with shift, and orbit the camera by dragging the mouse, with the wheel to zoom. The goal is the same too: collect the coins across the platforms and reach the flag, avoiding a spike hazard, with a fall off the world sending you back to the start.
GLM-5.2's game looks kind of rough. From the playthrough:
So it's not that great. It did nail one thing, though: the spring.

GLM-5.2 spring launch.
You can jump on the spring and launch up to the next platform.
Opus's game is cleaner, and plays well. From the playthrough:
The animations look good and run smoothly, with textures applied properly.

Opus: animations, textures, controller working.
Both models were told to verify their work before stopping. One common way an agent does this is to take a screenshot of the finished product and look at it, to check that nothing is broken or missing. That is exactly what Opus did in its session.
GLM-5.2 hit a problem here, because it can't read images. It isn't multimodal. So instead of looking at a screenshot, it fell back on a hacky workaround: it wrote scripts to read the raw pixel data and check whether the colors came out roughly as expected.
Because it couldn't see the screenshot it had saved, GLM-5.2 tried to verify the frame by reading its pixels instead. Here's an excerpt from its final report, where it "analyzed" the saved image by sampling colors:
final_start/overview/flag.pnganalyzed for color: grass green, dirt brown, coin gold, flag red, character bluish, half-Lambert lit, no black
The colors it expected were there, so it confirmed the game was finished and stopped. But as you can see in its own final screenshot below, the character is a flat gray with its textures missing, and the debug overlay is still sitting over the scene. An agent that could actually look at the screenshot would likely have caught both, and gone back to fix them.

GLM-5.2's final screenshot: textures missing on the character, debug overlay still on. It never saw the frame.
On a task with a visual result, being able to understand an image gives a model a real edge over one that can't.
Opus is multimodal, so it could read a screenshot directly. Its harness rendered the game and captured a frame, and Opus inspected that image as part of its verification. Here's an excerpt from its session, describing what it saw:
The final scene renders correctly: grass-topped blocks with brown dirt sides, the staircase climbing up, gold/silver coins and a jewel, the blue spike-block hazard on the right island, the red flag at the top goal, the character […] standing on the start plaza, and the score HUD. Lighting and shading are correct, geometry is clean.

Opus's screenshot: clean HUD, debug readouts removed.
Because it could see the frame, Opus noticed the debug readouts it had left on screen and cleared them before finishing.
Both games had bugs. Here's what broke in each.
GLM-5.2's bugs were frequent and visible, and several were fundamentals.
The character faces the wrong way. It walks in the right direction, but the model is turned backwards the whole time.

Missing textures and a disappearing head. The character renders flat gray instead of textured, and its head vanishes whenever the camera moves. The Kenney models point to a shared color palette in a separate file rather than embedding it, and GLM-5.2's renderer never loaded that file, so it fell back to flat colors. Opus loaded the palette, so its character came out textured.

The death spike doesn't kill. The character lands right on a spike hazard and nothing happens. No death, no reset.

Opus's were fewer and subtler, edge cases rather than broken basics.
Standing on thin air. The character can sit beside a platform, in mid-air, without falling. This is its coyote-time grace period, the brief window where you can still jump just after stepping off an edge, tuned a little too generously. A polish feature slightly overdone, not a broken fundamental.

Winning from too far away. The win triggers while the character is still well short of the flag.

Both models built a complete, running 3D platformer from scratch, no engine and no 3D library, in a single pass. That is a high bar, and not long ago neither would have cleared it. Here is how they split.
GLM-5.2 took over twice as long and shipped a rough game: a gray untextured character, a spike that doesn't kill, no working win condition, and a debug overlay still on screen at the end. Most of its bugs were fundamentals. It cost a fifth as much.
Opus finished in half the time and shipped the cleaner, more correct game. Its bugs were edge cases, not broken basics. It cost roughly four times as much.
Opus can read images, so its self-check looked at the rendered game and caught visual problems. GLM-5.2 is text-only: it verified through numbers and never saw that its character was gray or that its debug overlay was still up. On a visual task, that was the difference between catching the rough edges and shipping them.
One game is one data point. The benchmarks below test the same kinds of ability at scale.
Z.ai published these benchmark numbers alongside the release, on its model card. The best result in each row is in bold.
| Benchmark | GLM-5.2 | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Reasoning | ||||
| HLE | 40.5 | 49.8* | 41.4* | 45 |
| HLE (w/ tools) | 54.7 | 57.9* | 52.2* | 51.4* |
| AIME 2026 | 99.2 | 95.7 | 98.3 | 98.2 |
| GPQA-Diamond | 91.2 | 93.6 | 93.6 | 94.3 |
| IMOAnswerBench | 91.0 | 83.5 | – | 81 |
| Coding | ||||
| SWE-bench Pro | 62.1 | 69.2 | 58.6 | 54.2 |
| NL2Repo | 48.9 | 69.7 | 50.7 | 33.4 |
| DeepSWE | 46.2 | 58 | 70 | 10 |
| ProgramBench | 63.7 | 71.9 | 70.8 | 39.5 |
| Terminal Bench 2.1 (Terminus-2) | 81.0 | 85 | 84 | 74 |
| Terminal Bench 2.1 (best harness) | 82.7 | 78.9 | 83.4 | 70.7 |
| SWE-Marathon | 13.0 | 26.0 | 12.0 | 4.0 |
| Agentic | ||||
| MCP-Atlas (public) | 76.8 | 77.8 | 75.3 | 69.2 |
| Tool-Decathlon | 48.2 | 59.9 | 55.6 | 48.8 |
An independent run by ArtificialAnalysis broadly agrees:
The numbers track our test: GLM-5.2 leads the open-weights pack and trades blows on reasoning, but Opus still takes most of the coding and agentic rows.
These benchmarks span three areas. Here's what each one tests, grouped the same way as the table.
Reasoning, hard math and science exams:
Coding, fixing bugs and building whole projects:
Agentic, calling and chaining real tools:
Benchmarks and our own test are one thing; the online reaction is another. A lot of it is hype from accounts with no track record, so we stuck to people and groups whose judgment has held up over time.
Simon Willison has written up nearly every notable model release for years. He called GLM-5.2 "probably the most powerful text-only open weights LLM."
His standard test is to ask a model for an SVG of a pelican riding a bicycle. GLM-5.2 returned a fully animated one with nothing broken, which he called "very impressive."

A second test, an opossum on a scooter, came out worse than GLM-5.1 had managed a version earlier. So it's strong, but not uniformly.
Artificial Analysis, an independent benchmarking group, ranked GLM-5.2 the leading open-weights model on their Intelligence Index. It scored 51, ahead of MiniMax-M3, DeepSeek V4 Pro, and Kimi K2.6, and sits on their cost-versus-intelligence frontier as the cheapest model at its level.
They flag the same thing we ran into: it's token-hungry. It uses about 43k output tokens per task, most of it reasoning, more than any other leading open model they measured.
Nathan Lambert tracks open-weight models for a living at the Allen Institute for AI. Looking at where GLM-5.2 lands on the LMArena leaderboards, he argued that "you could argue they have a better agent than Gemini does," and called it "a serious accomplishment" for an MIT-licensed open model.
His wider point is that the Chinese labs are reaching these scores on far less compute, and shouldn't be discounted, even if the top US models still lead overall. That matches our test, where Opus came out ahead but GLM-5.2 was closer than its price and openness would suggest.
So, is the hype real? Mostly.
GLM-5.2 is a genuinely strong open model, at a fraction of Opus's price. For a lot of work, that combination is hard to beat. But it isn't Opus. In our test, Opus was faster, shipped a cleaner and more correct game, and could check its own work by looking at it. GLM-5.2 was far cheaper, but rougher, and it's text-only.
Use GLM-5.2 when cost and openness matter and the work is mostly text and logic. Use Opus when correctness, polish, and visual judgment matter, and you'll pay for it. And keep GLM-5.2 in the arsenal regardless: it's the rare frontier-adjacent model that no vendor can take away from you.
I also use MiniMax-M3 in utility roles like explore/library tasks.
I’ve had a z.ai subscription for several months so I’m on the older pricing. I’m really not sure it would make sense to do this at current rates - I could bump my Codex plan instead.
If I recall, that model had a couple issues. One was the issue of being monkeyed with, for which they gave everyone credits.
The other feature/bug, depending on your POV, was being Anthropic's least personable release, not papering over everything with self help guru therapy language.
Opus 4.6 didn't LARP. It was more direct, less fussy, less discussy, and very much less "wait, one more thing" within a couple edits after embarking on what should have been the spec, than 4.7 or 4.8 are.
When in engineer brain mode, working as as you describe (good old fashioned XP-style staff engineer pair programming with a language-savvy mentee not yet full-stack or system wise), I found the clearer I was about my goal and the better I could express it, the more often I'd get an expanded clarified response I could then iterate to steer for ever tighter cleaner more specified responses, then let it go build the whole thing without it agonizing and waffling.
The next two releases regressed on that dimension, wanting to figuratively "sit with" every decision and re-validate spiritual alignment along the way, no matter how clearly expressed.
Curiously to me, Fable seemed to hit the best of both worlds, I had the highest commit per turn with Fable, approaching 73%, where I'm usually under 17% of LOC written being good enough to commit, usually taking 9 - 11 turns to get the code where I'm comfortable with it.
Thanks to this, Fable cost more, but actually cost less, if that makes sense.
Arguably, Fable, and 4.6, played more outcome-correctness oriented than journey-experience oriented. It's easy to see how this could happen with human reinforced learning if not all judges are staff or principal engineer level, or constitution values are more Portlandia than Finlandia.
ANTHROP\C needs to balance these at the constitution level:
“We will work in a humane and thoughtful way, but production is the final judge. We will listen to people, but we will not let discussion replace decision. We will value craft, but not at the expense of usefulness. We will move fast, but not by hiding risk. We will measure outcomes, but not pretend that everything important is easy to measure.”
GLM-5.2 actually has really good intent understanding though, on par with GPT-5.5 and Opus from my experience.
Part of me wants to believe they really do care about protecting the world from... something... I don't know quite what exactly tbh... but it must be costing them a small fortune to scan each input and output against N guardrails and they are a for-profit corporation who could easily turn a blind eye to all of this and simply say "what you do with this model is on you" like I would expect most corporations to.
Strange times.
They picked a task that heavily favors a model that can do multi-modal with images, and GLM still came within striking distance.
What I'm hearing from this article is that the next generation of open models that includes better multi-modal support are basically no-brainers for adoption.
Seems like a HUGE win for Z.ai and open models in general here.
you go to OpenRouter and pay
$0.98 / $3.08per 1M for GLM 5.2 vs $5 / $25per 1M for Opus.
GLM 5.2 gives you OPTIONALITY so you can run it locally, but you can still just pay somebody for it.
Coupled with a local Headroom (https://github.com/headroomlabs-ai/headroom) you'll be able to use a LOT without hitting your 5h window :)
Definitely the best $ value for me considering the reasonable performance of GLM5.2.
They provide a rolling window quota, so you're never really out of quota contrary to other providers, you can adjust day to day.
Check it out if interested : https://synthetic.new/?referral=kwjqga9QYoUgpZV
---
Docs & all models : https://dev.synthetic.new/docs/api/models
From there I collected the following US providers currently serving GLM 5.2:
- Together (https://www.together.ai/models)
- Fireworks (https://fireworks.ai/models)
- Featherless (https://featherless.ai/models)