GLM 5.2 beats Claude in our benchmarks

I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars...

This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.

Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.

I used it unquantized through Fireworks, but there are multiple other providers too.

I added GLM 5.2 to my security bug hunting benchmark when it came out, and found it to be a good performer, but not the best open model. The benchmark tests whether models can find bugs Mythos found. The best open models in the initial benchmark were DeepSeek V4 Pro or MiMo 2.5 Pro. But it turned out MiMo got lucky, it's performed worse on almost every test I've done since, while DeepSeek has consistently been among the best performers and its extreme caching performance makes it cheaper than just about anything, including much smaller models.

https://swelljoe.com/post/will-it-mythos/

Also of note, I found giving models access to the open source semgrep as a tool makes some perform worse and none perform better, though it's plausible there's a way to wire it up in a harness that presents useful information to the model without the model having to know how to use it (my theory is that semgrep isn't heavily represented in the training data, so you're asking the model to do two things at once: figure out how to use semgrep and find security bugs, and both tasks suffer for the lack of focus...most small models, and some big models, can't do that well).

Edit: But, also, more testing is ongoing. I suspect GLM 5.2 will also be a consistently strong performer. It seems to excel at most things I've tested on it.

Has anyone compared the costs between maxing out a Claude Max x5 subscription (one for €120 euro a month) and same amount of work on GLM5.2 via API at a cost of $4 per mln token out?

I have a feeling Anthropic may still come out cheeper (mainly thanks to enterprises subsidising the Max subscriptions).

But I'm very excited with the possibility of using fully EU based inference rivalling Opus in quality.

Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?

[1] https://huggingface.co/zai-org/GLM-5.2

Are open labs just loss leaders backed by Chinese govt? Is this like electric cars where the goal is to flood the market with good enough quality for free so they end up dominating the market?

Or is there a business model I’m missing?

There's no question to me, after trying both, that Fable is much better than GLM-5.2 when left alone in front of hard coding tasks Now maybe what plateaus is the human collaboration efficiency, because at some point it will be bottlenecked by the human

Thus companies who still try to have humans perform intertwined work with their AI won't see an improvement, while the ones who fin the right conditions to give their AI more free rein will see it.

Kind of like it's no use having a workhorse pull a combine harvester : at some point, when machines reach sufficient efficiency, you just give wheels to the harvester and let it run.

These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact. It honestly wouldn't surprise me anymore if china started surpassing models that US makes public, at least in specific categories such as cyber.

GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.

I hope someone is also building a Claude Design competitor. One that is similarly HTML based instead of the Figma/Magic Patterns approach.

I have more vendor lock-in with Design than I do with Code, and will switch over as soon as Claude loses the smallest technical advantage

GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.

Not that it would make any sense.

> [...] beating Claude Code (32%) at roughly $0.17 per vulnerability found

Claude Code is an agent harness, not an LLM.

Claude is a brand (or group of LLMs), not an LLM.

Title is misleading (and is editorialized from the actual article title). GLM 5.2 did better than Claude in one specific cybersecurity-related benchmark (finding vulnerabilities of one certain type). I don't think you can draw any general conclusions about the relative utility of the two models.

I use GLM 5.2 via Neuralwatt and it's gotten so cheap I wouldn't mind cancelling my personal Claude subscription if work gave me one. I've spent 374M tokens this month and it only cost me $18 on energy-based pricing.

It reads like an ad.

Secondly these are "just" IDORs, arguably the easiest class of vulnerabilities.

Thirdly it compares to GPT 5.5 and Opus 4.8.

No, we don't have Mythos at home.

Most interesting things to me from their benchmarks:

GPT does way worse than Opus without their harness, but better with it.

Opus 4.7 and 4.8 do way worse than 4.6. (Intentional nerfing?)

Would have been interesting to see GLM in the custom harness.

Would also be interesting to run GLM in Claude Code, which it has presumably been fine tuned on.

I used Claude a lot, but with Claude Code it takes a lot of context window, and it's very pricey, to be honest. Then I shifted towards Minimax. I used the coding plan because it's cheaper, but it still gets the job done. When M3 came out, I started using it, and it was actually really good. After that, I shifted towards OpenCode for my AI agent, and that's been really good as well. The best thing I realized is that it uses less context, works better, and gives me access to a lot of different models from one place. I never actually used GLM, but I recently found QuanCode, which is amazing. I used it to build a full-stack application. Now I'm shifting my focus more toward SaaS distribution. I'm still figuring out how to automate different workflows, and using QuanCode has been really fast and effective for building those automations.

GLM-5.2 suggests long-horizon agentic work is becoming open, cheap, and deployable.

What does that mean for the frontier?

https://lifeinthesingularity.com/p/glm-52-proves-ai-comes-fo...

They should also at least run Opus through the same Pydantic harness they used for GLM. As is, it's apples vs pears.

Where's the cost per vulnerability for all the other models than GLM?

Also, without code this isn't very trustworthy. Could all be made up as well.

Does a bit worse than Opus 4.8 in my tests[0], but it's 5x cheaper and 3x slower.

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...

I find it astounding that ppl still comment “it’s still behind” or “it’s not the best model”. Everything is about the harness. Even the big AI labs are focusing on managing agents - sandboxes, memory, context, skills, loops. With the right harness GLM 5.2 can do no wrong.

About running models locally and why data centers win (for now): they can stream the model weights to many neural engines at the same time, so each of these only needs enough RAM to hold the KV cache. So each engine is cheaper to operate, plus they are time-shared, resulting in massive wins for data centers.

So one can see businesses owning their own such cluster, next to their database infra, in the near future.

> beats Claude in our Cyber Benchmarks

Beats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).

It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.

If it’s not quite as good as the hype yet, I expect it probably will be in the near future. To do a lot of the primary coating tasks needed for most situations, it’s probably gonna be good enough if it isn’t ready. The harness will be there as well.

I am using this with a workflow of Claude Code, Codex, Kimi and GLM and the results are pretty astounding and almost 90% of the times Claude's findings and plans are overturned with Claude's agreement.

Chinese models are almost certainly cheating on benchmarks, I would bet if you saw the training data that the benchmark canaries are in there.

GLM may be a good model in general but it s benchmaxxed and definitely not as good as Opus 4.8.

I don't feel the numbers without the harness are useful.

People will use the model with the harness. I know that harness may not be optimized to this model, but it's still more useful to see the numbers from an imperfect harness than from a no harness setup.

I think one thing people are missing about this article is that they are arguing that the harness can make a bigger difference than the model. They aren't merely hyping GLM 5.2.

> Constant: the IDOR dataset (the same real, open-source applications we've used in prior research) ...

What we're they? Also, wouldn't one expect a more recently released coding agent (with a more recent knowledge cut off) to perform better because they have access to more knowledge about vulns in these OSS projects, and even possibly have knowledge of your own "prior research"?

Twice in the text quotes Claude Code's F1 score as 32%, but the table shows the score is 37%. It's very likely that the actual score is 32% (because it is referenced 2 times, and a third time indirectly as the difference 'seven').

Oddly, this is a strong indication of the text being hand-written rather than LLM-assisted; it's very likely that a human made a mistake in creating the table.

  > ... beating Claude Code (32%) ...

  > ... GLM 5.2 ... beat Claude Code by seven points (39% vs. 32%).


  > Rank | Configuration           | Harness         | F1
  > ...
  > 4    | Claude Code (Opus 4.6)  | Claude Code SDK | 37%

It’s hard to argue against the open weight models if your only concern is coding. Which, for many of us hackers here in this forum, it is.

But I would like to point out that the overwhelming majority of people using LLMs aren’t programmers, don’t care about coding, and couldn’t even be bothered to “vibe code”.

So we should consider the bias of the output of these open weight models, and what that looks like, outside of the context of writing code.

Here, it appears they compare a single prompt "find IDOR", against a multi-agent system. However, one can also start far more sophisticated skills that spin up subagents and mostly do the same in Claude Code, Codex, OpenCode, Pi, etc.

Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?

I switch from Codex to GLM 5.2 when I'm out of tokens. The main difference for me is time to completion.

GPT gets there <5 minutes, GLM 5.2 without context takes ~1H.

Though the harness makes a significant difference. On Pi GLM5.2 dreams for minutes, with OpenCode it's more on the point and gets to editing quicker.

> We ran a set of popular open-source models against our IDOR benchmark.

"our IDOR benchmark", there you go.

I tried GLM many times and it is bad, i have on clue what these people are talking about

The title of the post on their blog is really misleading "We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks". Mythos (or Fable) isn't even benchmarked, and there's giant caveat literally at the bottom: "We have a caveat: This is one task, one dataset, one run."

I think the post is still informative, but very a little disingenuous and clickbaity.

but, it's $160/month(unless you buy a one-year plan that gets cheaper), not too far from $200/month from claude and codex? why should I switch?

But… what effort level? “Opus 4.8” is a massive capability range. If you just ran it on medium that is a completely different result than vs. max.

open-weight models routinely match or even outperform previous-generation proprietary APIs

Genuinely curious. Say GLM 5.2 is better than Opus. But how does one go about using it by themselves?

Definitely a +1 from me. I've really enjoyed using it via OpenCode/Zen. Not loving the pricing with OC so will probably switch to OpenRouter once my credits are done.

Feeling proud on these Open Models. Its just they need to focus on efficiency as well especially in terms of size.

Opus 4.8 is genuinely one of the most frustrating models in casual use. It has a tendency to completely lose context in the middle of a conversation. It’s also too pedantic and nitpicky, and relies on language that’s way too specific to get any work done. I always end up being frustrated with it and revert to opus 4.6

Which harness do you recommend to run coding task with glm 5.2?

Any good resources about this (also for setup and recommend config)?

Every agent run writes an audit record. Not for compliance theater — because when something breaks at 2am, you need to know exactly what happened and why.

You can launch GLM-5.2 in Opencode using Nemesis8: https://github.com/DeepBlueDynamics/nemesis8#nemesis-8

After installing, do a `n8 build` to build the image, then `n8 --danger --provider opencode interactive` to launch it in a container.

Signup for GLM-5.2 here: https://z.ai

If only the "cybersecurity" crowd were focused on patching the vulnerabilities.

Instead of shilling for the LLM providers.

GLM 5.2 - Super Clear GPT-5.5 - Super Smart Auto/Composer - Super Fast (cursor)

This is because of the safeguards and not the model capabilities. If these folks signed up for the proper cyber service offered by Anthropic where refusals are removed then the open weight model wouldn't look as capable.

How do you run this thing? What kind of hardware do you need?

Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.

I used it unquantized through Fireworks, but there are multiple other providers too.

https://swelljoe.com/post/will-it-mythos/

Edit: But, also, more testing is ongoing. I suspect GLM 5.2 will also be a consistently strong performer. It seems to excel at most things I've tested on it.

Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?

[1] https://huggingface.co/zai-org/GLM-5.2

GLM 5.2 is a great model, but if you only want to use the best model available, it isn't there yet. Every lab releases models that memorize benchmark answers, both intentionally and unintentionally. But we consistently find that models from Chinese labs have a wider gap between public benchmarks and our evaluations, which we designed to be less vulnerable to benchmaxxing.

In multi-agent coding environments, GLM 5.2 is just shy of Opus 4.6 on average. Data at https://gertlabs.com/rankings

But when factoring in performance/cost, GLM 5.2 is the frontier model.

Im really curious about this. Why pay API pricing? I burn 1000s of dollars a month of api according to claude usage but only pay the $100 subscription

If you're using Matrix, consider Hermes as a harness if you haven't already. Native gateway support. I've been primarily using mine through Element and it has largely been great.

> This weekend I programmed a matrix bot with encryption and a Rust agent with some tools.

Did you program or did you gave the order to an agent to program?

Nice. I'm working on an agent too. How are you handling tool calls?

I followed this example

https://minimal-agent.com/

but I'm running into issues with nested backticks so I'm thinking of making dedicated close tags per tool call.

I am seeing extremely positive results with Elixir too. Previously I was on Deepseek (deepseek-v4-pro) and GLM5.2 outperforms Deepseek easily. It's been a month since I used any native Claude models (simply because of pricing) but then, GLM5.2 is running for me at $20/day in usage on OpenRouter for GLM5.2. I am not sure if I've misconfigured Claude code or if this is indeed normal usage pricing. But, the output more than makes up for it. However, using Deepseek v4 pro directly from deepseek.com using their discounted pricing is insanely cost efficient. I topped up $10 a month and a half ago and I'm still yet to use up all the money in my account. Here's hoping that SOTA models become even cheaper!

Are you sure fireworks is unquant? It's not listing precision on openrouter like everyone else.

Have you tried using DeepSeek V4 Pro instead? It will be cheaper and faster than GLM.

> A typical session for me with GPT is usually over a hundred dollars.

I don't think a $100 session is "typical". I use GPT for months. $20/m plus plan is enough for my daily work.

$20 on API pricing or on subscription?

GLM 5.2 and DeepSeek v4 Pro seem to approach security research differently. This benchmark was with GLM 5.1, but the patterns are similar: https://dualuse.dev/posts/deepseek-v4-thinks-different

Overall, I still think GLM 5.2 is the much stronger performer. It's hard to tell the difference between GLM 5.2 and Opus at <120k tokens.

I believe it is because GLM 5.2 has extra anti-cyber training instilled in it. Similar to Kimi k2.7 code.

Deepseek v4 pro being in preview with less "safety" training makes it stronger for that reason. Thinking will be different and in the end, it will actually try to be useful. Just expect future Chinese LLMs to further push out "safety" guided LLMs. The future is bleak for open weight models. Prepare to have "guidelines" enforced unceremoniously to all.

Every time a new frontier model arrives I have it give one specific codebase of mine a once-over for bugs and other idiotic mistakes.

Fable found a couple of good ones, then we lost Fable, so I tried GLM5.2 and it found two critical bugs that Fable had missed, so it got my seal of approval.

We need a benchmark of independent community sourced benchmarks!

…probably already is one

could mimo have scraped the mythos findings already? it's very recent

Aren't you the Webmin guy?

I ran it on my laptop, which is a Lenovo Legion 5i (think 32 GB RAM, 4060 w/ 8 GB VRAM, you get the picture). It was a quantized model (otherwise it would not fit on my NVMe 1TB drive) at 4 bits per weight - UD_Q4_K_XL. It ran at about 12 seconds per token (not tokens per second). A fun project, but not worth it. I used 4096 tokens of context cache, and I ran it with llama.cpp - as it supports memory mapping. Because the whole thing could obviously not fit in RAM, I was curious how much it would need to stream from SSD. The answer? For a simple 4 sentence description of who it was, about 1.5 TiB was streamed from disk.

short answer: they mostly aren't

A few people are running highly quantized models with limited context windows. It's still impressive, but not the benchmark level intelligence. Very few people could afford a rig for reasonable local performance at a reasonable quant, at full context size.

The antirez example is 2.6bit quant, 32k context, and few tokens per second... on a ~$7000 MacBook M5 (new RAM pricing).

8 X RTX6000. It will run you around 80-100k to get started with a model at this size with decent tps..

Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.

For $100k you could run this model 24/7 through open router with 10 concurrent sessions at 50tps for a decade and have money left over for a vacation. There's no point in investing this type of money in local models unless you have a business where you're already paying for many employee's individual token usage.

I am finding Chinese models are introducing more guidelines against cyber. Especially Kimi k2.7 code seems to have extra training against cyber security capabilities. Last one, k2.6 was a lot stronger at cyber but obviously the Kimi team improved over time, so this is not the best they can do but no one will be able to get the best anymore.

I expect future Chinese models to introduce even more of this type of bogus "safety" training.

Looks like if you are a white hat, then you will be fighting an uphill battle. Black hats will be fine, they will not care, they can just run a heretic model or specialty trained model.

> These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact.

Care to give more context to this? Seems interesting

It will almost for sure surpass the models which Trump will allow US "allies" (which he just considers client states) to use. This, together with China's growing dominance in PV, rechargeable batteries, EV, could really be the nail in the coffin for the post WWII economic world order.

GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.

Not that it would make any sense.

If that happens it'll be an absolute disaster. Imagine a scenario where Anthropic and OpenAI prohibit most US companies from using their latest models because of safety.. And meanwhile attackers use equivalent open source models to attack US companies.

Any prohibition on open source models will do nothing to fix the problem.. since attackers will never feel bound to the law. All advanced models must be available for defensive purposes.

The Americans may ban the use of the Chinese models in America. But like the Chinese car ban, everyone else will use them.

> GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.

I’m sceptical they could find the legal framework to do this even if they wanted to

They have legal authority to (a) prevent export of US goods/services; (b) ban imports of physical goods; (c) ban transactions (including purchasing services or license agreements) with foreign firms

But I’m not aware of any legal authority which lets them ban US firms from running a Chinese-developed open source AI model in the United States, if they are at arms length from the vendor, and aren’t using it for government contracts or regulated applications

Possibly they could order HuggingFace/etc to suspend Chinese accounts. But if someone in the US (or a third country) downloads the model from China then reuploads it to a US server, completely independently of the vendor - where is the legal hook to prohibit that?

>GLM export controls incoming?

US imposing export restrictions on a model from China?

I think state-of-the-art AI is going to be defense industry only from now on. We can have our toy drones but not the Predators and Reapers.

Obvious answer: build all your open source LLMs into firearms, get the SC to grant 2A protections.

Cool then everyone will just change their config to route through a provider overseas for an added 50-100ms latency. Who cares.

> [...] beating Claude Code (32%) at roughly $0.17 per vulnerability found

Claude Code is an agent harness, not an LLM.

Claude is a brand (or group of LLMs), not an LLM.

Yes, and the article author is fully aware of that. Thank you for pointing out this small mistake though.

The dollar amount is meaningless without comparison - and no other model has a price tag. Sloppy article.

It costs nothing to not be pedantic.

Claude code it's the only way to get access to the actual amortized cost of running a Claude-scale model. The consumer non-enterprise API is extremely expensive (with increasing marginal costs for the user and fat profit margins for Anthropic). If you want to approximate a State level attacker's cost where they can have the model on their own hardware, Claude Code is probably the best guess at the amortized cost.

1000% this, this was us internally testing if our harness worked, the motivation was never to test them in-depth 1v1. We were just really shocked at the results, there’s a lot more work to do here.