Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.
Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?
There is also a GLM 5-code model.
I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.
> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.
A new type of model has joined the series, GLM-5-Coder.
GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.
Looks like they also released their own agentic IDE, https://zcode.z.ai
I don’t know if anyone else knows this but Z.ai also released new tools excluding the Chat! There’s Zread (https://zread.ai), OCR (seems new? https://ocr.z.ai), GLM-Image gen https://image.z.ai and Voice cloning https://audio.z.ai
If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.
Very fascinating stuff!
Certainly seems to remember things better and is more stable on long running tasks.
Claude Opus 4.6: 65.5%
GLM-5: 62.6%
GPT-5.2: 60.3%
Gemini 3 Pro: 59.1%
Betting on whether they can actually perform their sold behaviors.
Passing around code repositories for years without ever trying to run them, factory sealed.
Solid bird, not a great bicycle frame.
If it's anywhere close to those models, I couldn't possibly be happier. Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.
Agreed. I think the problem is that while they can innovate at algorithms and training efficiency, the human part of RLHF just doesn't scale and they can't afford the massive amount of custom data created and purchased by the frontier labs.
IIRC it was the application of RLHF which solved a lot of the broken syntax generated by LLMs like unbalanced braces and I still see lots of these little problems in every open source model I try. I don't think I've seen broken syntax from the frontier models in over a year from Codex or Claude.
So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.
US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.
[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...
[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...
[3] https://www.reuters.com/world/china/chinas-customs-agents-to...
In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.
To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.
We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.
I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?
How do you read this?
Edit: Input tokens are twice as expensive. That might be a deal breaker.
Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.
If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.
Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.
Particularly for tool use.
Full list of models provided : https://dev.synthetic.new/docs/api/models
Referal link if you're interested in trying it for free, and discount for the first month : https://synthetic.new/?referral=kwjqga9QYoUgpZV
Care to elaborate more?
something that is at parity with Opus 4.5 can ship everything you did in the last 8 weeks, ya know... when 4.5 came out
just remember to put all of this in perspective, most of the engineers and people here haven't even noticed any of this stuff and if they have are too stubborn or policy constrained to use it - and the open source nature of the GLM series helps the policy constrained organizations since they can theoretically run it internally or on prem.
Today's meme was this question: "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?"
Here is Claude's answer just right now:
"Walk! At only 50 meters (about 150 feet), it would take you less than a minute to walk there. Driving such a short distance would actually be less convenient - by the time you get in the car, start it, drive, and park, you could already be there on foot. Plus, you'd save fuel and your car would be right there waiting for you after the wash is done."
Here is ChatGPT, also right now:
"Since it’s only *50 meters* (about 160 feet), walking makes a lot more sense.
Here’s why:
* *It’s extremely close* — you’ll get there in under a minute. * *Driving would actually be less efficient* — by the time you start the car, move it, and park, you’re already there. * *Cold starts aren’t great for your engine* — very short drives don’t let the engine warm up properly. * *It saves fuel and emissions*, even if it’s a tiny amount. * You avoid the hassle of maneuvering such a short distance.
The only time driving might make sense is if:
* The weather is terrible (heavy rain, snow, extreme cold). * You have mobility issues. * The car wash requires you to stay in the vehicle and you don’t want to leave it there.
Otherwise, this is a perfect “walk it over” situation. "
Please save us all that talk about frontier and SOTA and that only the closedAI models are any good and the others are all so bad and benchmaxxed. For most purposes a Toyota is just as good as a BMW or a Merc or whatever luxury brand tickles your fancy. Even worse, the lastest 80B Qwen Next is not far from Opus 4.6 but runs on my laptop.
But this here is excellent value, if they offer it as part of their subscription coding plan. Paying by token could really add up. I did about 20 minutes of work and it cost me $1.50USD, and it's more expensive than Kimi 2.5.
Still 1/10th the cost of Opus 4.5 or Opus 4.6 when paying by the token.
And we will have Deepseek 4 in a few days...
Since you presumably want the car to be clean, the answer depends entirely on how the car is getting washed. The Breakdown If it’s a Professional Car Wash: You’ll need to drive. Bringing the car to the wash is generally the prerequisite for them washing it. Plus, walking 50 meters just to ask them to come pick up a car that is visible from their front door might result in some very confused looks from the staff. If it’s a Self-Service Pressure Washer: Again, drive. Dragging a 1,500kg vehicle 50 meters by hand is an incredible workout, but probably not the "quick chore" you had in mind. If you are hiring a Mobile Detailer: Walk. Stroll over there, tell them where you're parked, and walk back home to watch TV while they do the work in your driveway.
You're assuming the conclusion
The previous GLM-4.7 was also supposed to be better than Sonnet and even match or beat Opus 4.5 in some benchmarks ( https://www.cerebras.ai/blog/glm-4-7 ) but in real world use it didn't perform at that level.
You can't read the benchmarks alone any more.
Since the goal is to get your car washed, the car needs to be at the car wash. If you walk, you will arrive at the car wash, but your car will still be sitting at home"
Are you sure that question is from this year?
"" [...] Since you need to get your car washed, you have to bring the car to the car wash—walking there without the vehicle won't accomplish your goal [...] If it's a self-service wash, you could theoretically push the car 50 meters if it's safe and flat (unusual, but possible) [..] Consider whether you really need that specific car wash, or if a mobile detailing service might come to you [...] """
Which seems slightly (unintentionally) funny.
But to be fair all the Gemini (including flash) and GPT models I tried did understand the quesiton.
You should definitely drive.
If you walk there, your car will still be dirty back at your house! Since the goal is to get the car washed, you have to take it with you.
PS fantastic question!
It's a big deal that open-source capability is less than a year behind frontier models.
And I'm very, very glad it is. A world in which LLM technology is exclusive and proprietary to three companies from the same country is not a good world.
Before you get too excited, GLM-4.7 outperformed Opus 4.5 on some benchmarks too - https://www.cerebras.ai/blog/glm-4-7 See the LiveCodeBench comparison
The benchmarks of the open weights models are always more impressive than the performance. Everyone is competing for attention and market share so the incentives to benchmaxx are out of control.
Obviously for the average US tax payer getting along with China is in our interests - not so much our economic elites.
I use both Chinese and US models, and Mistral in Proton’s private chat. I think it makes sense for us to be flexible and not get locked in.
OpenCode in particular has huge community support around it- possibly more than Claude Code.
Here is the response from Gpt-5.2 using my default custom instructions in the mac desktop app.
OBJECTIVE: Decide whether to drive or walk to a car wash ~50 meters from home, given typical constraints (car must be present for wash).
APPROACH: Use common car-wash workflows + short-distance driving considerations (warm engine, time, parking/queue). No web sources needed (stable, local decision).
OUTPUT: • Default choice: Drive. A car wash generally requires the car on-site, and 50 meters of driving is trivial. • When walking makes sense: • You’re going to drop the car off (hand-wash/detail) and pick it up later. • You want to avoid moving the car now (ice, tight parking, you’ve been drinking, etc.) and someone else will handle the car. • Micro-guidance: • If it’s an automatic wash you stay in, drive. • If it’s detail/hand wash with a wait, drive there, then walk home (or just wait).
CONFIDENCE: 92/100 Assumptions: It’s a normal car wash that requires your car to be there; conditions are safe to drive 50 meters.
This is a silly test for the big coding models.
This is like saying "all calculators are the same, nobody needs a TI-89!" and then adding 1+2 on a pocket calculator to prove your point.
Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/
We need a new, authentic scenario.
This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.
AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.
>China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI.
2. cost is only a singular input into price determination and we really have absolutely zero idea what the margins on inference even are so assuming the current pricing is actually connected to costs is suspect.
Have you had good results with the other frontier models?
I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.
I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.
They're comparing against 5.2 xhigh, which is arguably better than 5.3. The latest from openai isn't smarter, it's slightly dumber, just much faster.
Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.
> US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips
Sort of. It's all a front. On both sides. China still ALWAYS had access to Nvidia chips - whether that's the "smuggled" 1s or they run it in another country. It's not costing Nvidia much. The opening of China sales for Nvidia likewise isn't as much of a boon. It's already included.
> At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China
Again, it's a front. It's about news and headlines. Just like when China banned lobsters from a certain country, the only thing that happened was that they went to Hong Kong or elsewhere, got rebadged and still went in.
Intelligence per <consumable> feels closer. Per dollar, or per second, or per watt.
That being said, this model is MIT licensed, so it's a net benefit regardless of being benchmaxxed or not.
Let's have a serious discussion. Just because Claude PR department coined the term benchmaxxing, we we should not be using it unless they shell out some serious monetes.
As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025
1. Take the top ten searches on Google Trends
(on day of new model release)
2. Concatenate
3. SHA-1 hash them
4. Use this as a seed to perform random noun-verb
lookup in an agreed upon large sized dictionary.
5. Construct a sentence using an agreed upon stable
algorithm that generates reasonably coherent prompts
from an immensely deep probability space.
That's the prompt. Every existing model is given that prompt and compared side-by-side.You can generate a few such sentences for more samples.
Alternatively, take the top ten F500 stock performers. Some easy signal that provides enough randomness but is easy to agree upon and doesn't provide enough time to game.
It's also something teams can pre-generate candidate problems for to attempt improvement across the board. But they won't have the exact questions on test day.
US bluff got called. A year back it looked like US held all the cards and could squeeze others without negative consequences. i.e. have cake and eat it too
Since then: China has not backed down, Europe is talking de-dollarization, BRICS is starting to find a new gear on separate financial system, merciless mocking across the board, zero progress on ukraine, fed wobbled, focus on gold as alternate to US fiat, nato wobbled, endless scandals, reputation for TACO, weak employment, tariff chaos, calls for withdrawal of gold from US's safekeeping, chatter about dumping US bonds, multiple major countries being quite explicit about telling trump to get fucked
Not at all surprised there is a more modest tone...none of this is going the "without negative consequences" way
>Mistral in Proton’s private chat
TIL
I'm not immediately discounting Z.ai's claims because they showed with GLM-4.7 that they can do quite a lot with very little. And Kimi K2.5 is genuinely a great model, so it's possible for Chinese open-weight models to compete with proprietary high-end American models.
I noticed whenever such meme comes out, if you check immediately you can reproduce it yourself, but after a free hours it's already updated.
Last time there was a hype about GLM coding model, I tested it with some coding tasks and it wasn’t usable when comparing with Sonnet or GPT-5
I hope this one is different
Codex was super slow till 5.2 codex. Claude models were noticeably faster.
Dollar/watt are not public and time has confounders like hardware.
Uh yes? Deepseek explicitly said they used H800s [1]. Those were not banned btw, at the time. Then US banned them too. Then US was like 'uhh okay maybe you can have the H200', but then China said not interested.
> They believe model capabilities do not matter as much as application.
Tell me their tone when their hardware can match up.
It doesn't matter because they can't make it matter (yet).
And yes, the consequence is strengthening the actual enemies of the USA, their AI progress is just one symptom of this disastrous US administration and the incompetence of Donald Trump. He really is the worst President of the USA ever, even if you were to just judge him on his leadership regarding technology... and I'm saying this while he is giving a speech about his "clean beautiful coal" right now in the White House.
>The main flaw is that this idea treats intelligence as purely abstract and not grounded in physical reality. To improve any system, you need resources. And even if a superintelligence uses these resources more effectively than humans to improve itself, it is still bound by the scaling of improvements I mentioned before — linear improvements need exponential resources. Diminishing returns can be avoided by switching to more independent problems – like adding one-off features to GPUs – but these quickly hit their own diminishing returns.
Literally everyone already knows the problems with scaling compute and data. This is not a deep insight. His assertion that we can't keep scaling GPUs is apparently not being taken seriously by _anyone_ else.
Those of us who just want to get work done don't care about comparisons to old models, we just want to know what's good right now. Issuing a press release comparing to old models when they had enough time to re-run the benchmarks and update the imagery is a calculated move where they hope readers won't notice.
There's another type of discussion where some just want to talk about how impressive it is that a model came close to some other model. I think that's interesting, too, but less so when the models are so big that I can't run them locally anyway. It's useful for making purchasing decisions for someone trying to keep token costs as low as possible, but for actual coding work I've never found it useful to use anything other than the best available hosted models at the time.
I'm guessing both humans and LLMs would tend to get the "vibe" from the pelican task, that they're essentially being asked to create something like a child's crayon drawing. And that "vibe" then brings with it associations with all the types of things children might normally include in a drawing.
Do electric pelicans dream of touching electric grass?
Then they haven't. I said the non-officially obtained 1s that they can't / won't mention i.e. those Blackwells etc...
Which is exactly how you're supposed to prompt an LLM, is the fact that giving a vague prompt gives poor results really suprising?
It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.
It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.
I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.
While I do understand your sentiment, it might be worth noting the author is the author of bitandbytes. Which is one of the first library with quantization methods built in and was(?) one of the most used inference engines. I’m pretty sure transformers from HF still uses this as the Python to CUDA framework
That you think corporations are anything close to quick enough to update their communications on public releases like this only shows that you've never worked in corporate
The whole idea of this question is to show that pretty often implicit assumptions are not discovered by the LLM.
I don't think there's a good description anywhere. https://youtube.com/@t3dotgg talks about it from time to time.
For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/