And, of course, it was burning 10 times more tokens for this output.
This implies that bigger models are more likely to hallucinate? That doesn't match my experience.
From how they measure it, a model that simply answers "I don't know." to any prompt would be the one hallucinates the least. So it's not surprising at all that a smaller model can perform better.
"they say u hallucinate 3x more than GLM 5.2, whats your comeback to this? do i need to dump u? $article"
> Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse
These are wild claims - why are we concluding that bigger models and more data = more hallucination? That’s actually the opposite of what’s been happening over the last couple years. Some models may still hallucinate more but they all hallucinate much less than the original 175B ChatGPT which was smaller and trained on (much) less data than anything current.
Edit: My mention of data comes from this quote:
> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling
My take on the current situation: it seems clear that the industry has seen that there is still a lot left to squeeze out of sub-1T models. But for that you do need more, high-quality data in the distribution which you want to unlock capabilities for.
Sam Altman himself had a blog post about this a while ago that seemed to suggest this thought, so I guess it's obvious to everyone. But if that is so I assume it's just not as easy in practice.
I'd also hesitate to attribute this difference in hallucination rates purely to model size. Yes, GLM-5.2 hallucinates much less frequently than DeepSeek-V4 Pro with twice as many parameters, but DeepSeek-V4 Flash is less than half the size of GLM-5.2 and tops the AA-Omniscience hallucination index. Opus 4.8, which is likely larger than DeepSeek-V4 Pro, has a 36% hallucination rate on the index, above GLM-5.2's 28%, but way below the DeepSeek numbers. Opus also has a 47% accuracy rate vs GLM-5.2's 25%. If you use these numbers to calculate the absolute hallucination rate (i.e., the number of hallucinated responses divided by the total number of responses), you get 19% for Opus and 21% for GLM-5.2.
So yes, all else equal larger models may be more prone to hallucination in scenarios where they don't know the answer, but there are a lot of other factors that affect hallucination rates, and it's not totally clear that this is the main metric that's worth tracking.
Wow! I already knew from previous research shared here that hallucinations are a fundamental problem for LLMs and likely to be unfixable, just like prompt injection, but I didn't realize the hallucination rates were so bad!
Everyone has been acting like the best models only hallucinate in edge cases, but even the best performing one mentioned here - GLM-5.2 - has a hallucination rate of 28% when it doesn't "know" the answer to something.
That said, I think the title on the blog - "Bigger models are not the way" is probably more fitting and touches on what should be even bigger news. If bigger models and bigger training sets have already stopped producing proportional returns, then it seems likely we are already near the top of the S-curve. That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the (absurd) idea of ever increasing scaling from these models.
I'm already hallucinating about how this could work and it involves catapults
In addition, I think that during HFRL, the labs has a bias for interesting answers that admit a solution and under represent the "bad" questions that admit no good answer. In addition they probably do less effort to HFRL on questions the model should admit it doesn't know.
As humans we have been trained all our lives, in the real world, to be confronted with questions we don't know the response right away and we learned to very quickly assess that we don't know or that we are not sure about the answer.
Another thing we have and LLM have not is fear. We have an amygdala in our brain, separated from the logic thinking part, that can raise a signal of fear so that we get much more carefully about what we say. On the other LLM has no fear organ like the amygdala and just learn to respond based on the patterns in it's training corpus. It never "fears" looking bad or being fired because it gave a wrong answer so it can merrily give perfectly wrong answers.
So, we see hallucination rates can be improved with training but currently the lab are not optimizing for that because there is an high stake race to get the most intelligent and capable model.
Alternatively I can see creating a separate amygdala-like organ for an LLM and that organ may asynchronously fires signal, based on the user prompt and the LLM thinking trace, to inject into the LLM reasoning a fear signal so that it can steer it's answer to something more safe.
GLM 5.2 tends to stray way more than and 5.1. It also hallucinates you things subtly: morphs requirements, makes unfounded conclusions. This output is not something I experienced in any model I seen so far.
In coding it's especially annoying because it steers whole request. E.g. I give instruction: "make we a Rust-WASM-Canvas app" and GLM 5.2 goes like "Oh user surely doesn't mean that. I'll better build Dioxus app instead".
The article uses the example of GLM being smaller than DeepSeek, yet better on hallucinations as "smaller can be good too"
But the GLM family itself is scaling up fast: GLM-5.x family is 754B, double the previous generation of GLM-4.x
> comes within just 4 points of GPT-5.5 and 9 points of Fable 5
9 percentage points IS a big difference
What about using two models, with a smaller model used for this kind of negative reasoning?
The task was simple, using the MS-MARCO[0] dataset which contains queries, search results, answers, I made a training set that has:
1. Questions paired with real results supporting them (mixed with some irrelevant results), and a correct answer
2. Questions paired only with irrelevant results, with the answer “No answer present”
The dataset was huge (close to 1M samples), and I trained using different techniques, from SFT (just mimicking the dataset) to DPO (good answer contrasted with a bad answer for the same user query) to GRPO (verifier that checks my annotations whether an answer was present or not)
Lo and behold, this didn’t reduce hallucination, rather made it much worse. Now the model started claiming “No answer present” even when it is, or even when the question didn’t need search results in the first place (simple stuff like what is X+Y).
Now you could argue that my training was basic compared to what frontier labs could do. Yet I think it hints at a more profound limitation. LLMs are finicky and don’t have a neat understand of things from first principles (list of search results, check relevance of result to user query, if answers are below a certain threshold of relevance then don’t consider them to answer …).
tl;dr: not as simple as one might think, perhaps not attainable at all.
Because that's what they measured in this case.
1) Has a certain standard of evidence been met?
2) Are the related arguments free of logical inconsistencies?
We can train the LLMs to do 2, and maybe even 1 to some extent (exactly what quality of evidence a computer can practically gather is limited). But that isn't going to get rid of hallucinations, for the same reason courts are hit-and-miss or the conclusions of studies often aren't very reliable. These techniques help, but sometimes they still get people to say things that, on close inspection, turn out to be nonsense. And those best-effort approaches are too much to expect for most questions an LLM will be handed which are informal, low stakes and don't need strong supporting evidence or logical rigour.
I think it is underestimated how many LLM-style hallucinations people themselves have. It just isn't obvious because most humans have a strategy of only repeating what the herd says after it has been socially vetted, which makes their individual eccentricities less obvious.
TLDR; I don't think it looks like an easy problem for RLVR, it looks technically unsolvable. Even making progress requires a philosophical breakthrough on the nature of truth so that the objective function can be established.
I'm pretty sure it's mostly due to the training data quality. No idea, why this never gets mentioned in those discussions.
It was obvious right from the get go, that the scaling law just enabled some abilities, that were described by the underlying data and allowing the ANN to abstract it in the latent space.
a key method to help with hallucinations is to provide good sources when asking questions (context engineering / knowledge base)
That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations
The relevant quote for what you’re talking about would be:
> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer.
So there’s two separate claims: 1) bigger models have plateauing results 2) models trained on larger amounts of factual data have a higher hallucination rate
I’m pretty sure #1 is well known, I think OpenAI’s own research on scaling laws showed diminishing returns on parameter count and training data volume years ago. I don’t know what the support for #2 is besides for the actual post contents.
AA-Omniscience is the only AI benchmark I know of where randomly guessing gets you a lower average score than answering all questions with "I don't know"
You can definitely tune a model to say "I don't know" more often but it will cost you performance, the model will reject some questions that it could answer meaningfully. In the degenerate case the model could collapse predicting that sequence always or almost always.
Yes, pretraining still exists. But for the past few years, pretraining by reading the internet is just the initial bootstrapping of LLM training. The RL training they get from bespoke training data, with very very different characteristics than what these armchair analyses claim, dominates these days.
For your scenario the confident confident strategy will give average of -90. Saying I dont't know to all will give 0.
A lot of models have negative AA-Omniscience Index.
They also do have AA-Omniscience Accuracy and AA-Omniscience Hallucination Rate that handle "I don't knows" differently.
But I guess my logic breaks down here a bit, because if there is such a thing as a validated answer, then the correct answer is in fact never uncertainty. The correct answer is to continue post training until the model gets it right. So perhaps the real answer is to create RLVR tasks where the valid answer is "I don't know" and nothing else, like this benchmark does. Or maybe that doesn't work either, no matter how many you create.
I feel as though there is some kind of philosophical lesson to be had from how hard hallucinations are to get rid of. Maybe, similarly to humans, successful models are often "arrogant" in a sense. Perhaps you just never solve an Erdös problem without some degree of self deception that it's possible for you to do so. In this line of thinking, greatness in humans is actually not related to humility, but just being so good that you actually get things right when you try. Expressing humility is of course something great people tend to do, but I'm referring to what happens under the hood.
If you squint a bit, that's kinda the trend with models. The useful ones are not that much less likely to hallucinate, they are just good enough that they tend to get it right. This comparison is of course probably not even remotely correct, but at least it's fun to anthropomorphize a bit.
If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult
Something about the cost model of US near frontier has the cattle prod out whenever a model is uncertain but thrashes on whether to search. Search flinch is roughly all hallucination.
I don't even wait for the model's turn, if there's a man page or Hoogle hit, stuff the last prefix cache cut point. You come out ahead.
I don't think anyone is trying to add "a coherent worldview" by reducing hallucinations, not sure how that even realistically could be aim.
What people want, is for the models to stop giving confident answers that are clearly incorrect. Yes, it won't lead to "a coherent worldview", but it'll at least stop wasting people's time if the model said "You know what, what you said doesn't make sense / isn't clear, is what you mean .... ?" or even "I'm not sure" or "I don't know".
Currently, if you have the wrong starting point, ask the model to do something, they more often than not just go ahead and do that, misunderstandings or not. They seem optimized to never push back, unless you prompt for that, and most seem to favor "I'm just gonna assume X" rather than taking a step back and figuring out how to not assume. Again, unless you prompt against that behaviour/steering it into a different workflow.
so, thats all.
I’m not sure how to explain it, but the more I see LLM-written code the more I feel it’s bad code doing a good job of masquerading as good code. I think this take will become less-hot in the next year or two when we see enterprise greenfield projects that were created entirely with LLM “assistance” go to prod. I think we’ll find that the code is difficult for humans to read, understand, debug, and extend- and I think the larger the codebase the harder it will be for LLMs to maintain. More opportunity for hallucination, larger context windows needed, more tokens bought and spent for smaller and smaller code changes. I think the more code an LLM writes for an app, the worse that codebase becomes.
Do you have a cite for this?
If a human makes up some bullshit lie, I wouldn't accuse them of making it up only if they actually knew the correct answer. If you don't know, the only correct answer is I don't know. Any other answer is made up bullshit. Why is it only a hallucination if and only if the LLM contains the answer? If you make something up it's still wrong. It shouldn't matter if you could give the correct answer. You didn't, and instead invented some bullshit instead?
Follow up question, how can I apply this rule set to the next test I have to take? I'd love to be able to use "I didn't know" as the excuse for why I made something up.
edit:
> and it's not totally clear that this is the main metric that's worth tracking.
I don't know, the rate at which some model is willing to make up something feels useful. If the argument I see repeated on HN so much is that it's impossible to completely get rid of hallucinations; being able to choose a model that's less likely to invent some lie seems like a positive trait, no?
Either way, I'm happy to agree that a restrictive definition, where a lie doesn't count as a hallucination iff the model doesn't know the answer feels strictly, infinitely less useful than an exact error rate. What percentage of emitted tokens are misleading would be useful for me. Anyone know any group that's attempted to quantify the global error rate?
Hallucinations all the way down...
They are much better incentives. In real life a wrong answer is much more damaging than a don't know.
We really don't know what the actual reason is given the politics at play. I would bet more on the Trump administration looking for any excuse to punish Anthropic
the oss models are impressive but it's pretty clear how quickly they fall off when you try to use them outside of a narrow set of problems they benchmarked well on when compared to opus/5.5
"Confidently incorrect" has negative value. At best, a human realizes the answer is wrong and At worst, the incorrect information makes is not identified and can cause untold damage. By having the potential to be so severely wrong, it lessens the value of correct answers because there is a lower confidence value on their output.
We can rank them based on how much they know and people will gravitate towards those that do know more.
It's a market after all.
If someone sold you a "Solved all your problems" machine, and it suddenly doesn't solve all your problems, then probably no, you shouldn't pay.
But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input", then regardless of what the outcome is, I still made use of the "Input > Output" part, which is what I bought into, so I should still pay for that.
Now of course bunch of people will say they been sold the former, but the companies themselves seem to be selling the latter. That's my perspective from a person who doesn't follow "influencers" and what not though, which seem to be selling the public on the former rather than the latter.
But as soon as you do minimal reviews and high-level corrections, applications turn out just fine.
Can there be bugs? Sure. That's the price of not reading or understanding every line. It should depend on the criticality of your software how much of these you tolerate and how much you don't (reviewing, understanding, testing everything 100% like you were used to if you had written it yourself will kill most if not all of your gained speed)
But I never got the impression of unmaintainability or unfixable bugs.
Actually the other side around: A really good cleanup pass, architectural changes, or bugfixes are seldom more than a few prompts and 2 hours away, provided your overall base is decent and you actually gave a fuck from the start.
They clearly are only assistants for the moment, you can use them to do work ... but only if you could do the said work yourself alone in the first place.
In other words, you shouldn't choose the model that hallucinates the least without detailed prompting, since a well-crafted agents.md clause should go a long way to improving output, and almost certainly the top scoring order will be different. To the point that I don't find this type of raw comparison useful beyond maybe 'make sure you test that one with more explicit prompts'.
I've had more success with creating a plan first and then implementing it in (short-lived) sub-agents.
Ironically good software architecture patterns (small functions, single responsibility) heavily impact the performance of these models as well. They do surprisingly well in well architectured codebases.
They do very poorly in anything that's a mess where Opus and GPT 5.5 still get reasonable performance.
> Why is everyone expecting LLMs to be like the Star Trek computer?
Because they are often marketed as magic AIs, not as mere language models.
[0] https://bpspsychub.onlinelibrary.wiley.com/doi/10.1111/bjso....
My ask:
> In a couple sentences, explain to me the product I'm being sold with ChatGPT. What does it do for me?
The Reply from ChatGPT:
> ChatGPT is a conversational AI that helps you think, create, learn, analyze, and get things done faster. You can use it to answer questions, draft and edit writing, summarize information, brainstorm ideas, learn new topics, write code, plan projects, and increasingly act as an assistant that can search for information, work with documents, generate images, and help complete tasks.
> In simple terms: you're buying access to an AI that turns natural language into useful work—saving time, expanding your capabilities, and giving you an always-available collaborator for both everyday tasks and specialized knowledge work.
This sounds much more like the former, a "solve all your problems" machine.... not a plausible-sounding text generation machine.
Only two weeks ago Sam Altman said their new data center "could" be where cancer gets cured[0]. It is only the people who deeply understand AI who see it as a text generator of plausible-sounding text. That isn't what the marketing department, the CEO, or the product itself seem to be saying. I'm using OpenAI as the example here, but the others don't seem much different.
> If you can dream it, Claude can help you do it. Claude can process large amounts of information, brainstorm ideas, generate text and code, help you understand subjects, coach you through difficult situations, simplify your busywork so you can focus on what matters most, and so much more.
What marketing copy have you read for LLMs that is like you mentioned?
> But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input"
You're prompting it wrong is quickly becoming the new, you're holding it wrong.
It's wild how willing software engineers are to blame the user when the actual problem is their own defective design.
Ideally we all, as an industry, will stop accepting this as reasonable excuse for the demonstrated incompetence
I strongly suspect most closed source code developed under commercial or internal pressure is pretty awful after a few years of development.
All LLM code has to do is suck less than existing code. And that's presuming the code quality doesn't improve as the models, the harnesses and our ways of working with them improve.
My observation is that they are equally bad and hard to maintain or even more so than the new ones.
One thing I’ve noticed is that the LLM assisted ones have a lot more comments which is nice but take more time to read.
On a more serious note, I think the problem will be the inability to handle/maintain the systems once they are too big and nobody has no idea what's inside of them or what they do.
When pushed, I then start thinking and realise my mistake. System 1 vs 2?
Jun 18, 2026
A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage when Claude Fable 5 was restricted by the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest models in the world was banned because a single jailbreak was too much of a risk.
The above is true in almost all cases. The biggest models in the world clearly score the highest on the Artificial Analysis Intelligence Index. Yet, Z.ai’s newest, GLM-5.2 (753B parameters, roughly 40B active), comes within just 4 points of GPT-5.5 and 9 points of Fable 5. Opus 4.8 and GPT-5.5 are proprietary and estimated to be in the 1-2T parameter range conservatively. If an open weight (MIT licensed) LLM can come so close to a closed weight model estimated to be 1.5 to 2 times bigger, it is clear that actual intelligence has plateaued significantly.
It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.
That seems incredibly rough for such a huge, popular model. Let’s test it with a relatively complex Python question with a clear architectural flaw.1
DeepSeek V4 Pro used almost 10 times the reasoning tokens yet produced a confidently incorrect response. On the other hand, it took GLM-5.2 just 12 seconds and about 800 reasoning tokens to recognize the technical impossibility of a single-threaded task executing multiplexed I/O without ever yielding or utilizing system polling. (For the non technical, this is like asking a delivery driver to drop off packages at 3 houses at the same time without ever stopping the truck.)
GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.
We should be very cautious about blindly increasing reasoning budget, corpus size, or parameter count. DeepSeek V4 Pro spent 3 minutes and 26 seconds wasting compute in a reasoning loop (raw reasoning here) just to generate a beautifully structured, confidently incorrect solution. Yet, a model half its size identified the paradox almost instantaneously. Even in today’s era as we near AGI, many of the biggest models will actively convince you that a solution is correct and that the problem was solvable as stated.
Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse. This applies for the consumer too, since we cannot continue to select models based on size or theoretical performance alone. Training and selection of AI needs to be designed around the unsolved trilemma of modern LLMs: raw capability, uncertainty calibration/hallucination rate, and computational efficiency.
Copyright (c) 2026 Oliver Shrimpton. All rights reserved
However the fear has to arise in the first place, to raise the alert.
Being an LLM is easy!
Might be why we're already rarely seeing models output an "I don't know".
> Can I trust the output you give me?
And I assume it explains what to trust VS not.
I think in the bottom you should also see something like "Any text can contain mistakes" or similar too, which I know is a far cry from what some people push in the press in regards to capabilities, but I still don't see the platforms themselves as lying about this, while I do see a bunch of people constantly over-hyping the possibilities.
Idk if it was the harness (OpenCode), my AGENT or my prompts, but I was getting exactly what I wanted, and quickly.
With GPT-5.5 it tries to play smart, takes much more times and is often stuck on basic stuff that DeepSeek solves oneshot.
The question-tokens define the answer-tokens. That's it. The art relies in clustering the relevant weights together.
I've yet to come across a human developer who's output would meet this standard, despite writing every line.
In fact, having an LLM review our code is catching quite a few bugs before it reaches QA.
I'm an experienced developer, but I don't count myself as a web dev or a python dev; I can review the web and python stuff I get out of the AI (sometimes I need to ask the AI follow-up questions so I can find official documentation for what it did), but I can't write it.
LLMs doesn't have this benefit. You forget to add the correct to the system prompt, and the LLM will repeat the same mistake over and over, and worse than that, their mistakes aren't based on their understanding, it's basically random guesses.
Humans, even bad coders, still seem to have some sort of architecture in mind, even if it's spaghetti, whereas LLMs (obviously) don't think more than a few steps, and never about the full scope of what they're contributing too, and on purpose too, because you want the context to be as small as possible when you work with LLMs.
With LLMs you need to thread carefully between "What does the LLM need to know?" and "Can I skip passing this to the LLM this time?" while a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK.
If your example had "Validate any details before sharing them with the user, with multiple sources" as the system prompt, it was using a model that is strong at following system prompts precisely and had access to some basic tools, then it'd spend maybe minutes more, but the answer would have been way more accurate.
But no, Google want "the new search results" (LLM hallucinations) to be on top, so we end up with "sounds plausible" answers instead "Collection of evidence from reliable/semi-reliable" or similar, which sucks. We could have quality, but it's too expensive/slow, so we get slop instead, just to maximize for speed and convenience.
And they do it faster than any human developer.
You have any session logs or similar that shows this thing? Never once, since I started using the codex TUI when it became available, has GPT models gotten stuck on something another model breeze through, I quite literally run every prompt I do through multiple providers, this would be very visible very quickly for me.
I remember trying every -codex variant of the models and could never get them to be productive for tasks taking longer than 5-10 minutes, compared to GPT 5.5 which quite literally worked through the night day (with the /goal feature), and actually had something valuable and useful in the end this morning that wasn't exploding in LOC and complexity. I don't think any of the -codex variants would have been able to do this at all, based on how they worked when I last used them.
Is this a good idea? Probably not—in the past we would only do that when the architecture was causing serious problems since it always has tons of behaviors that will accidentally not get carried forward, some of which are load bearing and will cause bugs.
Now we can do it in an afternoon and get the same long term bug behavior.
When someone asks a question, if I don't know the answer; I say I don't know.
System 1 vs 2 doesn't really matter... I won't use an LLM that's willing to make up random shit. Equally I also won't work with a human who does that. Trust and confidence a system will function correctly is an important quality, in both humans and genai
Now granted, if the boat salesmen were pushing hard on the idea that the boat would fly and even put little wings on the side and I bought the boat I might get really angry when I found out that it didn't fly. And I might angrily storm into the salesroom yelling about how the design is defective. But if someone pointed out 'hey, it's a boat perhaps you should stick to sailing around in it and stop getting your undies in a bundle about it not flying' the correct response is probably to take a closer look, ignore the salesmen, and cruise around the lake. LLM's are quite handy at some things and have some weird limits. Learn the limits, enjoy your time at sea.
Haven't heard about that law, but seems unlikely we can come up with ("discover") any sort of law that uses a concept ("truth") humans can't even agree what it means, and that's not for a lack of trying, we've been trying to figure it out for millenniums already with no end in sight.
Circuits which emerge in the layers during training are much more complicated than a simple Bayesian relation.
i dont see why software engineers are paid so well, and are so hard to hire?
just dump a bunch of requirements on a homeless person and itll just work out
But anyway, let the LLM verify the code to give advice on improvements but don't let it write code unverified. That's my opinion on it anyway.
There can be, you don't know if the closed source models aren't using something like DeepSeek's Engram.
The humans may skip unit tests and need reminding; the AI always write unit tests once it's in AGENTS.md or whatever, but my experience* was that 5-10% of the time the LLM's attempt at a "test" would, instead of executing the code and examining the results, open the source code as a text file and run a regex to find/exclude certain substrings.
* At the start of this year, because Anthropic and OpenAI were both offering free trials. IDK how much things have changed since then, some things change fast in this domain, other things don't.
Whilst I don't claim any true "understanding" as that is a very loaded term that doesn't mean it's just random guesses.
Anyone using recent LLM coding agents on a regular basis would probably agree that there's something going on that fits some non-athropomorphizing, non-sentience-assigning definition of "understanding"
As for the point about improvement - I think that's an orthogonal issue to the overall code quality. With regard to human codebases - there's plenty of scenarios that negate the improvement of individuals. We're comparing organizations with LLMs - not individuals with LLMs and that makes a significant difference.
It's not that you're holding it wrong, you're just wrong for expecting it to work the way it's described (able to one shot most problems these days).
While DeepSeek describe this as "knowledge lookup", what Engram is really trying to do is separate dynamic reasoning from static pattern recall, with the static patterns just being word-level n-gram statistics, not declarative facts/knowledge.
Just because 2-3 words often appear together in a sequence doesn't mean they represent a fact or truth (or falsehood) - it is just an n-gram statistical regularity.
If Engram helps reduce LLM GPU memory and FLOP requirements then that is great, but it's not a solution for Hallucination.
But the difference I allude to here is more like how "book reviewer" is a different job than "book author": yes, if you can review a book, you can also write one. Eventually.
Riding the exponential means you have to update priors more often.
Maybe the pre-2024 users do, but I've seen plenty of those exact "frontier models never hallucinate" comments on HN as well