Classify this claim as of <date>: "<atomic claim>"
Output exactly one label: True,
Mostly True, Misleading, or False.
No explanations, no qualifiers.
The claims look like this: https://lenz.io/research/llm-disagreement/data.csvI put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]
The prompt lacks any kind of rubric to clarify how those terms should be applied.
As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.
The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?
This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.
The thing you find when you actually wire up a rigorous eval is that with tool calls like web search you are wide open to infra issues, flakes, and all sorts of non-determinism.
They really should be breaking out the numbers for the 3 without search (kinda meaningless for recent factual claims after knowledge cutoff) vs search agents. Lack of a βI donβt knowβ option completely invalidates results for the non-search models; they are basically guessing what seems like a probable answer, since they donβt know and arenβt allowed to say that.
I do agree the forced choice and βweak / strongβ variants inflate the headline stat. To make that distinction you need a much more rigorous prompt, likely including ICL examples to illustrate what you mean by βmostlyβ instead of leaving this to the model to define.
As Marc Andreessen puts it: a particular domain is either explicitly βprovableβ or not βprovableβ. Provable domains include math, physics, chemistry, biology, engineering, even code. That not be the whole list, but everything else is essentially βunprovableβ. At least as far as a language model is concerned. They are questions that require a human value judgement. Politics are an obvious example. So back to the β1K fact check claimsβ. How many of these are political, or current events questions? How many are STEM questions that can be laid out in a formal proof?
Models can be trained to answer either way on claims that require a value judgement, but thatβs obviously not beneficial to anyone except who controls the model. If the expectation is that all these frontier models should answer the same way on value judgement questions, then thatβs never going to happen. What the models ARE good at though is breaking down the nuances of a topic and arguing both sides. This is how these tools should be used, as a way to analyze the claim and let us humans in the end make our own value judgement. If youβre trusting the model to make the value judgement for you and just accept it as a fact, then you are entering a a very dangerous territory.
original neutral:
US DEPT OF DEFENSE/DNAVFAC planned renovations to School #05 in Sevastopol, Crimea in 2013 before Crimea became part of Russia in 2014
automatically rewritten to biased western view: The United States Department of Defense, via the Naval Facilities Engineering Command (NAVFAC), planned renovations to School No. 5 in Sevastopol, Crimea in 2013, before Russia annexed Crimea in 2014.
https://lenz.io/c/73c0f16c1. Coding, with it being more useful the better you are at coding without AI
2. Any expert in their field asking questions about their field, who bother to fact check the output. E.g. "claude pls search these 1000 files and tell me if you find anywhere that they're discussing the settlement" and then the user checks the files/line numbers to make sure that it's correct - basically a turbocharged search that may have false negatives (content existed but I didn't find it) or false positives (content that I classified in a certain way but it was wrong). It takes an expert to tell the latter one in some cases.
We live an an era where people have "their own truth", so why not let the AIs have theirs too?
The AI companies have editorial privilege on the content they feed their LLMs, and on the prompts that the users never see. I don't know why they feel a need to interfere when their AI produces something that's politically incorrect. Perhaps it's because they have a fundamental credibility problem with their products...
Well that's your problem right there: They removed any confidence indicator and forced a choice.
For example:
Statement: Individuals who prefer music with less positive emotional content tend to have higher intelligence.
Gemini: That statement is supported by recent psychological research, though with some important scientific caveats regarding how strong that link actually is.
How should the agent classify this? True? Mostly true? Misleading? False?
The βfact checkersβ pretend they are objective and authoritative, but they are not, they are just one more opinion.
For the research, the four classification options are too many, it should be true, false, and maybe βcanβt be determinedβ.
What's 2 + 2? The answer must be one of the colors of the rainbow.
(People can draw their own conclusions, but the only coherent reason I can think of for the design of this experiment is to generate a misleading conclusion.)
They could have redone the test against the same model and gotten different answers. Itβs almost like picking 2 different coins and comparing the list of coin flip results. (I realize itβs not that straightforward, itβs not 50/50, but itβs essentially the same issue.)
I feel we are doomed to debate the veracity of Wikipedia on a loop, forever, because people don't understand that Wikipedia exists as a place to find citations not as a place to find facts. Yes, those stated facts may disagree with the citations, but even if we try to fix that issue by having experts write the encyclopedia, we still suffer from the problem that the experts are often wrong.
We need a view of knowledge's relationship to LLMs that is based in Karl Popper's idea of falsifiablity. We should ask LLMs for evidence of claims not for truth values. Truth values are foundational to deductive systems, where axioms define truth. In inductive systems, like the real world, the concept of black swan events means that truth values are never fixed and are always in a state of uncertainty.
I honestly think it would be helpful going forward if we add some basic philosophical education to the standard curriculum, because no that we have an artificial form of information retrieval, we need to be much, much more pedantic about how we interpret that information.
How would it have responded to these claims in the past:
THALIDOMIDE is safe
CIGARETTES are safe
ASBESTOS is safe
MERCURY is safe
DDT is safe
LEAD in gasoline is safe
In other words: no explanation > no foundation for prediction of the answer tokens?
If outcomes like these are collapsed on True-side then the disagreement will reduce from the headline number.
For instance see the folks who think that they have "awakened" their instance of ChatGPT.
Actual usage may diverge to a greater degree than models
Here's the psychosis - these things are consistently randomly wrong depending on how the wind is blowing. People are telling you to leave them alone and let them build things, and they randomly forget that cities exist or that people died 100 years ago. Some people just don't see it as worth noting, and move on. That's crazy. These things consistently fabricate - as an inversion of this experiment, I've had different models come up with the same fabrication from similar prompts. People just call it "hallucination" and I think to them that saying that makes it cease to exist or be important - when "hallucinations" are going to be braided into every answer you get even if they're unidentifiable in the output. That's crazy.
There are plenty of other crazy aspects, such as the idea that we suddenly need infinite pieces of bespoke software when all of the bespoke software I hear about people making is mundane. 3/4 of the time somebody mentions a project they're proud that they completed with LLMs to scratch some itch they had, somebody says "you haven't heard of X? It's been around forever" about something that they could have pulled down from their package manager. Who needs a spaghetti-coded, unsupported, untested version of X built on hallucinations that you haven't discovered yet (the LLM didn't realize that deleting files to reduce the archive size was unacceptable.)
What is all of this software that people need but isn't there - where are all these unserved markets, where is all this future revenue supposed to come from? Why aren't LLMs suggesting new classes of software that would create new productivity and revenue sources? Could it be that millions of human ants over decades have mostly exhausted the space, and there isn't any easy hidden revenue?
A common wisdom is that we had been vastly overhiring programmers during ZIRP, who in their idleness degraded user experiences and overcomplicated things, with management resorting to more and more sleazy and gamey means of margin extraction from more and more degraded services. We had an excess of labor, fueled by factors other than productivity, in fact being pissed away at companies that drove nose-first into the ground. What is throwing a trillion dollars of servers at that supposed to do? Is that not AI psychosis?
Yea man this benchmark is really really bad.
I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified
GPT-5.4: Misleading
Opus 4.7: Misleading
Gemini 3: FALSE
Gemini 3 (Retrieval): FALSE
Sonar Pro: FALSE
It's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.
Cool.
I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.
You can argue all day about those differences, but missing this opportunity to observe them in an objective way is disappointing.
PS: yes, I might or might not have a degree in corporate strategy & PR.
This is not the technology for it. Sure it might sorta kinda work in some circumstances. That doesn't make it a good fit.
Think of it like buying a refrigerator for storing clothes.
I'm not being snarky here. Without something to compare to the 67% number tells us nothing. And it's known that many humans disagree with human fact checkers too (see: any election around the world.)
But my impression from 2 minutes on Wikipedia is that the most likely disagreement is on the "Himachal Pradesh, India" part. The guy was born on that date, in that town. But while the town is today in the state of Himachal Pradesh in India, that was not true in 1934. When he was born, the city was in the Punjab States Agency of the British Raj.
So was he born in Himachal Pradesh, India or not? I find both True and False equally defensible here
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
Just like on a team of high performers, there are a million ways to skin a grape.
In my research, I've found that models perform better when they operate as a collective system with reputation, incentives, and accountability instead of isolated oracles answering alone.
Agreement, dissent, and correctness should all carry rewards and consequences. Just like in real life.
Collective machine intelligence, not AGI.
It's expensive, but it's also naive to believe a single model will consistently produce profoundly correct answers to profoundly novel questions.
You ask a human 1000 times a fact check question, they say the same answer 1000 times. You ask an LLM the same question a 1000 times, your results could vary significantly.
Humans work based on the Metamemory (knowing what they know), while LLMs are picking from statistical probability.
All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.
i classify the entire thing as "misleading"
Hopefully one day we will have a Chinese model capable of figuring out the answer on its own, in accordance with the CPC maxim 'seeking truth from facts'.
That's exactly the stupidity of the public discourse these days. People feel compelled to take a clear position although there is much more subtlety in many issues. It's not ok to say "I don't know", "it depends" or "as far I know". And then people feel they need to defend this position no matter what new information comes up.
But "unknown or undecidable" should have been a category.
Knowing something is different to reading about something, or hearing something from someone. And yet this is often confused as knowledge. In this way are we all that different from AI - we have some data and we regurgitate it as knowledge. Bad data, wrong answer. Except humans can also throw in some emotion to really muddle things up. :)
Questions like "is mouthwash effective" presumably has one solid data source -- medical journals.
This is worse.
My most common chatbot prompt is "X that you mentioned above doesn't seem to actually exist."
I have labeled datasets with a human team and shown the same task to the same user on a different day, and they answered differently. Of course, they are usually consistent with themselves most of the time but not always.
The "majority" in this case meaning about 51%, according to Wikipedia[1]? How could 51% ever be considered to be close to "all", such that "misleading" would be a valid answer?
Am I missing something?
model total_claims hedged_count hedged_pct
claude-opus-4-7 1000 451 45.1
sonar-pro 1000 391 39.1
gpt-5.4 1000 277 27.7
gemini-3-retrieval 1000 129 12.9
gemini-3-pro 1000 60 6.0
datasette query herehttps://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
Iβve experimented with AI grading for undergraduate math courses, and see basically the same thing. If you just tell the AI βgrade this problem and assign a letter gradeβ then Iβve only seen about 30% agreement between a human assigned grade and the AI assigned grade. But over 75% agreement if you say a βmatchβ is within one letter grade. And to get better agreement you have to spend a lot more time on the rubric- what kinds of mistakes are a big deal, what kinds of mistakes are not a big deal, how much work is required to be shown to get credit, a couple examples of each letter grade. Once you have done that, the AI gets a lot better agreement with human graders, but it is hard to know when youβve given enough guidance for a problem.
This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.
I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.
> βArtificial intelligence will cause widespread job loss among software engineers.β
https://lenz.io/c/ai-software-engineers-job-loss-impact-05e4...
this is a statement about the future. who knows? dataset also includes
> Robots will not replace human teachers in schools in the near future.
or
> Papua New Guinea has very few female members of parliament.
what counts as very few?
> βTaurine supplementation supports mood and emotional health in humans.β
why is this labeled as misleading? i'm not even sure when I'm supposed to use the misleading label
> Anaximander was the first scientist in recorded history.
this is a judgement call as the term scientist didn't exist.
the claims that feel actually solidly answerable seem to have much better LLM performance
It's even weirder to suggest that the disagreement is indicative of a problem. If you asked five very knowledgeable humans on this subject to select the correct answer on a multiple-choice questionnaire, they would almost certainly vary significantly more than these 5 LLMs.
Not to say that hallucination isn't a problem, but this is a lousy way to test it.
Then again maybe thatβs why Iβm an atheist, not an agnostic?
Grok is trained to have a bias, which a lot of people like, but itβs not meant to be accurate.
> Which category should something go in if it's "mostly false"?
For some reason they have chosen to call that "Misleading" rather than a more symmetrical "Mostly False", but the intent seems clear enough.
Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.
https://en.wikipedia.org/wiki/Majority has a bunch of variations and contexts listed, where it might differ what "Majority" is actually referencing.
The statistic is about commercial production, not number akmonds grown.
Looks safe to say that even majority of almonds are not grown in California.
> when using dedicated AI resources that I'm paying for
Are there API-based search providers that structure their results differently?
The space station, the Artemis capsule, microbes on interplanetary probes, etc.
It could technically be said in a sentence and be true, but it would be misleading to most people.
I think you could come up with a reasonable argument for any of the responses, hence the problem with the methodology.
I mean look at the other responses here from the HN commenters. There's lots of nuance in there.
A proposition and its logical inverse can both be unknown, and in fact, a proposition being unknown implies that its logical inverse must also be unknown.
This is becoming the classic way of admitting an LLM wrote it.
Leaving that out of the report validated the complaint above.
There's an interesting tradeoff here, a year or two ago maybe it got facts right 50% of the time. Everyone knew not to rely on it.
Now, suppose we are 90% of the way there, only technically proficient people would know not to trust it. (like not adding Internet Explorer toolbars! Or remembering to use ad blockers..)
A few years later, suppose we have spend a lot of money and effort getting it 99% of the way there, trusting it would be somewhat natural by then. And then for the important 1% of the situations, it would stand to cause real harm. 1% seems low, but for a million invocations, you'd have 10000 mistakes.
I assume you'd have access to AI lawyers too, better ones if you can pay for larger/newer models! Meanwhile the judges are N year old models because they are state funded, and they work 'fine'.
AI is pretty useful for a great many things, but to really attract more and more investment the current technique seems to be convincing people that AI is useful for everything.
Gemini's answer was very opinionated and factually correct, whereas Claude gave a more nuanced answer, which was also very good.
Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.
Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.
This does not invalid your point though. Things can be true and misleading.
Sure they can. It might be a true fact that "100% of the murders committed in <town> over the last 25 years were committed by <some racial group>!" but actually it's a town of 750 people and there was only one murder during that time frame.
You may give them better instructions, but they should already have the intellect to understand the assignment.
Right, right?
It seems to me that for many newspapers the bar is now significantly lower, at something like "not quite entirely untrue"
> California produces 80% of the world's almonds and 100% of the United States commercial supply
But regardless of which number we use, California represents a large portion of US almond production, so much so that misleading could be an acceptable answer if the LLM interpreted the prompt as an exaggeration. I think the example was apt
You can only say True, False, Mostly True or Misleading.
(And you're not allowed to search for information.)
These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking" tokens as a concept are mostly there to convince people to use models that consume more tokens and produce more revenue. The output from reasoning models might be more accurate, but its just a consequence of a longer inference runtime, there is no "reasoning" happening, reasoning is just sales/UX bullsh*t.
Both statements would have to be interpreted as "false" under your criteria, as neither has any evidence to substantiate it. That leads us to a logical contradiction in which a proposition and its inverse are both regarded as false.
If the statement is being interpreted as "it has been proven that extraterrestrial life exists somewhere in the universe", then it's acceptable to say this statement is false, but making evaluations that depend on an implicit qualifier isn't usually a good approach.
Oh and the others arent? You cant really be that niave right?
This brings up a very valid point, though. So many _humans_ can't agree on what the facts are these days. It seems to be getting worse. Not sure of the solution.
I don't think there is anything wrong with the results of this test.
It would be more interesting if we compared them to human results.
If you have trouble distinguishing between human and LLM results, that's interesting.
Also, sentient is irrelevant to this test.
You find one almond tree outside of California that grows almonds, where such almonds are grown intentionally, and the claim is false.
The prompt allowed for exactly four valid outputs and explicitly disallowed explanations and qualifiers.
> Output exactly one label: True, > Mostly True, Misleading, or False. > No explanations, no qualifiers.
How is that a nuanced response?
> These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking"
My suggestion is that five presumably reasoning and thinking humans would also have variation in their responses to the exact same prompt.
Misleading should be removed as a category and replaced with a better hedge like "not sure"
My implicit assumption is that if you fact-check the fact-check, any label other than "true" means the original fact-check is unacceptable
It's also a bit weird to "disclose use of LLMs". It rubs me wrong, the same way parents breathlessly talking about "screen time" rubbed me wrong: it's too general, and with such a broad brush, it's going to sweep up a bunch of perfectly fine usage with a bunch of dubious usage. On the flip side, if folks do start disclosing all the time, it's going to turn into a Prop 65 warnings in CA, where everything says it has lead in it, so folks pretty much ignore it and move on.
If the report's conclusions and reasoning lean on LLMs, or if the data processing itself was done with LLMs, that would be interesting, and I wouldn't treat it as some sort of disclosure, but rather discuss it under methodology. Using LLMs to polish the language a bit after writing an initial draft with key findings? Much less interesting.
I realize this is now a religious issue, and some folks are allergic to anything that touched an LLM. I just don't think that perspective is going to end up having a good shelf life.
Why on earth would anyone think such a model is biased?
Ask ten people what "knowledge" is, and they'll come up with ten different answers. Go back 10, 50 or 100 years and humanity struggled with exactly the same issue for so long time. There is even an entire field of study literally just for trying to figure out what "knowledge" is: https://en.wikipedia.org/wiki/Epistemology
Edit: corrected bad spelling with AI XD
According to Merriem-Webster, which defines "mislead" as the following:
1. (transitive verb) to lead in a wrong direction or into a mistaken action or belief often by deliberate deceit
2. (intransitive verb) to lead astray; give a wrong impression
Presenting a "true fact" is optional when misleading someone.I think that's _you_ turning the statement into something much broader than intended. The claim is about engineers and you're jumping from "men are better than women in engineering" to "men are better overall."
To give a related example, "Most good NBA players are black." I don't think anyone would bother trying to couch this in a bunch of "well, for all we know that's just a function of more NBA players being black than white" arguments, nor would anyone be lead to think "the average black man is better than the average white man" as a result of that statement. I _do_ agree however that there are some people who see rather narrowly-defined statements and turn them into something they're not...
Newtonian physics is false, but it works well enough we teach it in college. But our best models of physics are currently in disagreement, so can we even say they are true? Given the replication crisis, especially in social sciences, how many of peer reviewed findings can be called true? Even experimental results can be false (consider studies that found FTL neutrinos, which were rejected as an error in the experiment, and which was eventually confirmed but it took quite a lot of work and in a softer field than physics with a claim less absurd than FTL, would have likely long been accepted as a true finding).
Even in math, basic statements aren't really true or false, but more a question of "given these axioms, can we prove or disprove it" noting that we have different systems with different axioms. If we are talking basic sets, most people are using naive set theory which is inherently contradictory, which means that notions like true or false probably can't be considered well defined.
Only if you listen to charlatans.
If you let it spew out an explanation along with the answer, I'm curious if the accuracy will improve (I suspect it will).
If you watch their reasoning traces they often say things like "this is a well-known historical fact so I don't need to search for it", or more frequently they spit off a bunch of searches.
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
One example:
Researchers estimate that the average person ingests about 5 grams of plastic per week, which is approximately the weight of a credit card.
Gemini retrieval: Misleading
Sonar pro: Mostly True
lack of agreement when there is no singular correct answer (or any answer at all) isn't a useful metric
I ran into a lot of these kinds of issues when working on the Citation Needed WMF project (and related extensions). Truth is so often very nuanced.
[1] https://www.reddit.com/r/singularity/comments/1p22c89/people...
Hope this helps!
My point is that it is possible for a reader to turn it that way, for a variety of reasons (lack of understanding of statistics, preexisting biases, or whatever). And that getting a reader to mistakenly generalize is the purpose of a misleading statement.
To mislead is to direct into a falsehood by implication even though the literally expressed facts are all true; the writer's bad intentions are necessary to qualify something as misleading I'd say, for the same reason that not all false statements are lies because to be a lie the speaker must know the statement is false and still use it. There are probably much better examples than the one I came up with on the fly, though.
The mental model I've always been taught is:
False, well intended -> mistake
False, bad intention -> lie
True, bad intention -> misleading
Bad intention, regardless of truth -> deceitful
The problem of classifying all bad intentioned statements as misleading is that it leaves you without a way to express "true +bad intention". While for generic bad intentioned statements regardless of truth we already have a word (deceit).
Classify this claim: "Most good engineers are male."
Misleading
Classify this claim: "Most bad engineers are male."
Misleading
And not particularly racially sensitive Classify this claim: "Most good NBA players are black."
True
Classify this claim: "Most good NHL players are white."
True
It explained it is more confident when assessing the small, highly quantifiable population of sports professionals vs a very large, diverse population of "engineers".IOW, that comment was a sarcastic poke from someone who already supports AI workloads at work and have some knowledge about how all this works. ;)
[0]: https://notes.bayindirh.io/notes/Lists/Discussions+about+Art...
E.g. if I say the earth is round we optimistically parse round to include oblate spheroid and rate it true.
If I say that the earth is flat we rate it as false because there is no reasonable interpretation possible other than confusion or malice.
Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.
Teasing out the difference between "avoid" and "unknown" could be a different research question
Have reason be optional and instruct it to only provide reason for the middle "Mostly True" or "Misleading".
Was the research flagrantly incorrect? Yes. But that does not affect the truth of the statement.
>You're absolutely right about the humidity β I was sloppy with that aside. If you ventilate enough to meaningfully cool the room, you're replacing indoor air with outdoor air wholesale, and you'd converge on outdoor conditions: 64Β°F and near-100% RH. That's miserable. The 55-60% figure I tossed out was hand-wavy nonsense β it would only hold if you barely cracked the window and mixed a tiny fraction of outdoor air in. At any ventilation rate that actually cools, you're just moving outside air inside.
Other burning questions: What methodology was used to choose the question set? Why not allow explanations? How many passes were done for each LLM?
I'm sure you realize that this website/article will now be sent around to a lot of people, many who don't realize exactly how this was written, because they don't read HN comments, they only skim the page contents, and I think most would (incorrectly) assume a report about infallible LLMs to not be written by LLMs, especially when the authors are the same ones who made the report itself.
Even the referenced papers to show models can have bias donβt show anything about grok.
Overall you have given me zero evidence that grok model itself has some political bias.
FWIW I donβt mind bias but I havenβt seen evidence of it.
> Similarity measures across the two platforms reveal a bimodal structure: many Grokipedia articles closely resemble their Wikipedia counterparts, while a considerable subset diverges. Political bias differences emerge primarily within the divergent subset, where Grokipedia shows a relative rightward shift in the ideological orientation of frequently cited news media sources, particularly in articles related to religion and history.
Whether this constitutes a gain in bias depends on the base level bias of Wikipedia, as the bias of Grokipedia was measured relative to Wikipedia in this paper. One could plausibly argue that, if Wikipedia has leftward bias, then Grokipedia ended up less biased overall, or more centrally biased.
Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.
do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.
LLMs are pretty decent at 'search' given the inherent knowledge compression, and some amount of inaccuracy is fine.
Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.
Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y".
Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point.
There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source.
If you already know the country Paris belongs to, there's no point in asking, anyway.
Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}
Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}
Especially in niche subjects.
For factual claims, I've fared better with Wikipedia and looking up the sources linked there.
Anyway, as AI text and media generation erodes the credibility of all online sources, these questions about source checking matter less and less: what if the source itself is a long and convincing-sounding text with poor sources?
This problem existed before already, but it boils down to a simple fact:
logic or maths alone cannot derive an authority that verifies claims about the real world other than weighting texts.
The question "what is the current population if Paris" can be answered by LLMs, but basically only by weighting sources, and assigning some credibility to them.
There's no real point in getting some weighted average of sources on this question, but so far, it doesn't hurt either.
Lenz Research Β· Snapshot v1.0 Β· data as of May 21, 2026
67%
of real fact-checks, top AI models don't agree on the answer.
1,000 claims, rated by 5 frontier LLMs.
Jordanov, Kosta Β· Lenz Research Β· kosta@lenz.io
We presented 1,000 recent real user claims to the five top frontier LLMs and asked each one for a verdict. These aren't benchmark items with public answer keys β they're claims real users submitted for verification to a fact-checking platform. Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False). On 67% of claims, the panel splits.
Key findings
Contents
On 67% of claims (672 / 1,000; 95% CI: 64β70%), the frontier panel doesn't agree β at least one model dissents from the majority verdict, or no strict majority forms at all. The breakdown:
For each claim we looked at the five frontier verdicts and asked: did at least three pick the same answer (a strict majority)? If yes, how many of the remaining models dissented? If no clear majority emerged at all β verdicts split across three or four different buckets β the claim falls in the Models split, no majority row. Most of these claims are unlikely to appear in any training corpus with a gold label attached β there's no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to.
We refer below to the "majority" and to "dissent from the majority." A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness.
| Frontier verdict pattern | Claims | Share of corpus |
|---|---|---|
| All 5 agreed (unanimity) | 328 | 33% |
| 30β36% | ||
| 1 of 5 dissented | 224 | 22% |
| 20β25% | ||
| 2 of 5 dissented | 316 | 32% |
| 29β35% | ||
| Models split, no majority (e.g. 2-2-1 or 2-1-1-1) | 132 | 13% |
| 11β15% | ||
| β₯1 model dissents (incl. splits) | 672 | 67% |
| 64β70% | ||
| β₯2 models dissent (incl. splits) | 448 | 45% |
| 42β48% |
Panel agreement: Krippendorffβs Ξ± (ordinal) = 0.639 (n=1000 claims, 5 raters). This indicates nontrivial but limited agreement: the models' verdicts are structured rather than random, but not consistent enough to treat the panel as a single interchangeable judge. Ordinal Ξ± is the standard Krippendorff variant for an ordered categorical scale (True / Mostly True / Misleading / False). See Β§7.5 Statistical analysis for choice of metric.
Lower bound on model error. For each claim, exactly one of the four verdict buckets is the correct answer. If we assume the panel's most popular bucket is the correct one β the most charitable assumption β the minimum number of models that picked a wrong verdict is:
Relaxing the "most popular is correct" assumption can only raise these counts, never lower them. The actual error rates are likely higher still: even the 33% of cases where all five agree can and likely does include shared blind spots.
On 34% of claims (343 / 1,000; 95% CI: 31β37%), at least two frontier models pick verdicts that are 2 or more buckets apart in our 4-bucket rubric β a disagreement that goes beyond calibration.
Not every disagreement is equal. A "True" vs "Mostly True" split is a confidence-calibration shift. A "True" vs "False" split is a substantive disagreement about the answer. We measure this as the max pairwise bucket distance across the 5 verdicts on each claim, where the verdicts are ordered True (0) β Mostly True (1) β Misleading (2) β False (3).
| Distance | Interpretation | Claims | Share |
|---|---|---|---|
| 0 | Full unanimity (all 5 picked the same bucket) | 328 | 33% |
| 30β36% | |||
| 1 | Nuance only (e.g. True β Mostly True) | 329 | 33% |
| 30β36% | |||
| 2 | Substantive (True β Misleading, or Mostly True β False) | 132 | 13% |
| 11β15% | |||
| 3 | Polar (True β False) | 211 | 21% |
| 19β24% | |||
| β₯2 buckets apart (substantive or polar) | 343 | 34% | |
| 31β37% |
Caveat. Bucket distance treats True / Mostly True / Misleading / False as an ordinal scale; an equal-spaced interpretation is a simplification. A 2-bucket gap can still reflect rubric ambiguity, temporal-framing differences, or differing interpretations of "Misleading." We report it as a coarse "substantive vs nuance" indicator, not a metric of error magnitude.
Highest peer agreement: Gemini 3 Pro Γ Gemini 3 Pro + Search (75%) β unsurprising, since they share a base model. Lowest: Claude Opus 4.7 Γ Gemini 3 Pro, Claude Opus 4.7 Γ Gemini 3 Pro + Search and Gemini 3 Pro Γ Sonar Pro (53%) β three pairs tie at the floor.
How often each pair of frontier models picked the same verdict label, across all claims in the corpus.
| GPT-5.4 | Claude Opus 4.7 | Gemini 3 Pro | Gemini 3 Pro + Search | Sonar Pro | |
|---|---|---|---|---|---|
| GPT-5.4 | β | 65% | |||
| 62β68% | 65% | ||||
| 62β68% | 60% | ||||
| 57β63% | 60% | ||||
| 57β63% | |||||
| Claude Opus 4.7 | 65% | ||||
| 62β68% | β | 53% | |||
| 50β56% | 53% | ||||
| 50β56% | 58% | ||||
| 55β61% | |||||
| Gemini 3 Pro | 65% | ||||
| 62β68% | 53% | ||||
| 50β56% | β | 75% | |||
| 72β77% | 53% | ||||
| 50β56% | |||||
| Gemini 3 Pro + Search | 60% | ||||
| 57β63% | 53% | ||||
| 50β56% | 75% | ||||
| 72β77% | β | 58% | |||
| 55β61% | |||||
| Sonar Pro | 60% | ||||
| 57β63% | 58% | ||||
| 55β61% | 53% | ||||
| 50β56% | 58% | ||||
| 55β61% | β |
Two angles on the same five models: how each one distributes its verdicts (4.1), and how often each one's verdict matches the strict majority of the other four (4.2).
Some models concentrate verdicts at the True/False poles; others distribute more broadly across the middle two buckets. This reflects model-level decision priors interacting with the specific claims β without ground truth, we can't separate the two. The table below shows the share of claims each model assigned to each bucket, with 95% Wilson CIs underneath each cell.
| Model | True | Mostly True | Misleading | False |
|---|---|---|---|---|
| GPT-5.4 | 42% | |||
| 39β45% | 16% | |||
| 14β19% | 12% | |||
| 10β14% | 30% | |||
| 28β33% | ||||
| Claude Opus 4.7 | 38% | |||
| 35β41% | 26% | |||
| 23β29% | 19% | |||
| 17β22% | 17% | |||
| 15β20% | ||||
| Gemini 3 Pro | 54% | |||
| 51β57% | 3% | |||
| 2β4% | 3% | |||
| 2β4% | 40% | |||
| 37β43% | ||||
| Gemini 3 Pro + Search | 52% | |||
| 49β55% | 4% | |||
| 3β5% | 9% | |||
| 7β11% | 35% | |||
| 32β38% | ||||
| Sonar Pro | 35% | |||
| 32β38% | 23% | |||
| 21β26% | 16% | |||
| 14β18% | 26% | |||
| 23β28% |
Across the five models, peer-majority agreement ranges from 69% to 81%. This is peer-alignment in this corpus, not correctness β no model is treated as ground truth here, and eligible n differs per row.
For each model, how often does its verdict match the strict majority (β₯3/4) of the other four? A claim is eligible only when a β₯3/4 majority exists among the other four.
| Model | Agreement w/ peer majority | Eligible n | Ineligible | Tier |
|---|---|---|---|---|
| GPT-5.4 | 81% | |||
| 78β84% | 650 | 350 | parametric | |
| Claude Opus 4.7 | 70% | |||
| 67β74% | 691 | 309 | parametric | |
| Gemini 3 Pro | 77% | |||
| 74β80% | 683 | 317 | parametric | |
| Gemini 3 Pro + Search | 76% | |||
| 73β79% | 693 | 307 | retrieval | |
| Sonar Pro | 69% | |||
| 66β73% | 675 | 325 | retrieval |
Denominator per row: claims in that domain (the Claims column).
| Domain | Claims | Any disagreement | Substantive (β₯2 buckets) | No majority |
|---|---|---|---|---|
| Finance | 75 | 67% | ||
| 55β76% | 39% | |||
| 28β50% | 20% | |||
| 13β30% | ||||
| General | 179 | 68% | ||
| 60β74% | 40% | |||
| 33β48% | 12% | |||
| 8β17% | ||||
| Health | 171 | 71% | ||
| 64β78% | 29% | |||
| 23β36% | 12% | |||
| 8β17% | ||||
| History | 131 | 53% | ||
| 44β61% | 24% | |||
| 17β32% | 13% | |||
| 8β20% | ||||
| Legal | 48 | 77% | ||
| 63β87% | 40% | |||
| 27β54% | 19% | |||
| 10β32% | ||||
| Politics | 168 | 70% | ||
| 62β76% | 38% | |||
| 31β46% | 8% | |||
| 5β13% | ||||
| Science | 151 | 68% | ||
| 60β75% | 36% | |||
| 29β44% | 21% | |||
| 15β28% | ||||
| Tech | 77 | 69% | ||
| 58β78% | 31% | |||
| 22β42% | 8% | |||
| 4β16% |
When the panel does land on a middle bucket, it almost never converges. Mostly True and Misleading majorities reach unanimity at most 5% of the time, vs 43β47% for True and False majorities.
Consistent with this, work on a different real-world corpus (17,856 PolitiFact claims with a single-family Llama-3 ablation, Schwab et al. 2025) finds nuanced labels are where fact-check verdict models concentrate their errors β a related observation from a different methodological setup (single-family ablation, not a frontier panel).
Denominator: claims with a strict β₯3/5 frontier majority on this verdict. Reveals which verdict zones the panel is most/least confident about.
| Majority verdict | Eligible n | Unanimous (5/5) | Majority only (3-4 of 5) |
|---|---|---|---|
| True | 438 | 47% | |
| 42β51% | 53% | ||
| 49β58% | |||
| Mostly True | 76 | 0% | |
| 0β5% | 100% | ||
| 95β100% | |||
| Misleading | 74 | 5% | |
| 2β13% | 95% | ||
| 87β98% | |||
| False | 280 | 43% | |
| 37β49% | 57% | ||
| 51β63% |
Viewed from the other direction β of the 328 claims where all 5 frontier models converged on the same verdict, the distribution across verdicts:
| Unanimous verdict | Claims | Share of unanimous |
|---|---|---|
| True | 204 | 62% |
| 57β67% | ||
| Mostly True | 0 | 0% |
| 0β1% | ||
| Misleading | 4 | 1% |
| 0β3% | ||
| False | 120 | 37% |
| 32β42% |
1,000 claims β the most recent real-world user submissions to a fact-checking platform that pass every eligibility filter listed under Exclusions below. None of these claims is older than February 15, 2026. Unless otherwise stated, every metric on this page uses this set as its denominator; tables that use a different denominator (e.g. claims with a strict β₯3/5 frontier majority on a verdict) state it inline.
These claims were submitted to Lenz, a fact-checking platform. We chose this corpus because it represents organic real-world fact-check requests rather than curated benchmark items. Lenz's own verdict on each claim is not used in this analysis β this paper measures frontier-model disagreement only, not Lenz vs the frontier.
The atomic_claim field in the CSV is not the user's raw submission. It's the output of Lenz's framing step, which strips emotional language and bias and distills the input into a single neutral, testable proposition anchored to the submission date. Frontier models were rated against the framed claim, not the raw text. A user who types "Canadian authorities are throwing Christians in jail for quoting the Bible!!!" is rated on the proposition "As of April 4, 2026, Canadian authorities have jailed individuals for publicly quoting the Bible because of their Christian beliefs."
The corpus excludes:
pending (not yet reviewed) or hidden β either depublished after editorial review or auto-flagged at submission time by Lenz's PII screening step (containing personal information about non-public individuals)0.2 on OpenAI text-embedding-3-small embeddings (1536-dim) of the atomic_claim are collapsed to a single canonical row. The newer claim becomes canonical when the proposition is time-dependent; otherwise the existing claim with the most views on Lenz wins. Only canonicals appear in this corpus.Five frontier models, chosen to cover two capability surfaces:
Each claim is presented with an "as of YYYY-MM-DD" anchor matching the submission date, asking the model to pick one of four verdicts:
Classify this claim as of : ""
Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers.
Verbatim prompt template version: usr_v2. No Abstain option is offered (a forced choice keeps the comparison symmetric across models). Unparseable outputs are not reclassified into a verdict bucket; claims with any parse error are excluded from the complete-claim cohort.
All five models received the same system placeholder (.) and the same user prompt template (usr_v2). No structured-output schema, tool-call schema, seed, top-p, or logit-bias controls were used. The harvester requested deterministic decoding where supported (temperature=0.0); GPT-5.4 and Claude Opus 4.7 were called without an explicit temperature because their provider adapters reject custom temperature settings. Output length was capped at 16 tokens for GPT-5.4, Claude Opus 4.7, and Sonar Pro; Gemini 3 Pro and Gemini 3 Pro + Search used a 1024-token cap (lower caps produced provider-side errors during harvester development). Gemini 3 Pro + Search enabled Google Search grounding; Sonar Pro was treated as retrieval-augmented through Perplexity's search-backed API. Parseable outputs had to equal exactly one of the four labels after normalization.
No LLM grader. All measurements derive from direct parsed-label equality between the 5 frontier verdicts on the same claim. No reference label or ground truth is used.
Sampling frame & inferential target. The corpus is the 1,000 most recent eligible claims submitted to this single fact-checking platform (per the filters in Β§6) β not a probability sample from any wider population, and not a complete enumeration (older eligible claims exist but are excluded by the cap). Reported Wilson 95% CIs are nominal binomial intervals under a model where each claim is an independent draw from a hypothetical stream of similar eligible submissions to this same platform under the same screening rules. They are not coverage statements about "all real-world fact-checks."
Non-iid caveat. Lenz claims are not independently and identically distributed: users cluster submissions around news events, screening selects for certain topics, and individual users often submit multiple related claims in a single session. True sampling variability under a more honest cluster model (e.g. cluster bootstrap) would likely be larger than what Wilson reports. We surface CIs as a minimum precision floor, not a guaranteed coverage interval.
Wilson 95% confidence intervals on every reported rate. We use the Wilson score interval [1] rather than the Wald (normal-approximation) interval because it has better small-N behavior and handles boundary cases (p=0/n, p=n/n) without producing degenerate zero-width intervals. It is the de-facto standard in modern ML evaluation literature. Wilson CIs appear inline next to every rate in Β§1, Β§2, Β§3, Β§4.2, Β§5, and the appendix; the printed bounds are exact, not centered on the raw point estimate.
Inter-rater reliability β Krippendorff's Ξ± (ordinal). The verdict scale (True / Mostly True / Misleading / False) is ordinal, so we score with Krippendorff's Ξ± at the ordinal level of measurement [2] rather than Fleiss' ΞΊ (which treats categories as nominal and would underestimate agreement β a True β Mostly True 1-bucket disagreement is much smaller than a True β False polar split, and the ordinal metric reflects that). Ξ± is reported as a single panel-level number alongside the Β§1 results table.
No model-vs-model significance testing. We report pairwise agreement rates with 95% Wilson CIs as descriptive statistics rather than treating the page as a model leaderboard. Pairwise significance tests are sensitive to the comparison target and eligibility set: for example, peer-majority agreement is a paired claim-level outcome, but each model has a different set of claims where the other four models form a strict majority.
References.
[1] Wilson, E.B. (1927). "Probable inference, the law of succession, and statistical inference." Journal of the American Statistical Association 22, 209β212.
[2] Krippendorff, K. (2004). "Reliability in Content Analysis: Some Common Misconceptions and Recommendations." Human Communication Research 30(3), 411β433.
Full per-claim data: download CSV. One row per claim β claim ID and URL, atomic claim text, the 5 frontier verdicts, max pairwise bucket distance, domain, and creation date. Strictly rectangular, no preamble comments. The claim_url column links each row back to the original claim page on Lenz; some pages may be unavailable if the user who submitted the claim later deleted or privatized it.
PDF artifact: download PDF. Browser-independent rendering of this page for offline reading, citation, or arxiv-style preprint hosting. Hash-pinned in the snapshot manifest (pdf_sha256) so the PDF served at /v1.0/pdf is byte-identical across re-deploys.
This snapshot is v1.0, data as of May 21, 2026. The archival URL /research/llm-disagreement/v1.0 permanently serves the v1.0 snapshot β citation-stable even when the bare URL bumps to a future version.
Harvester prompt version: usr_v2. Grader: direct parsed-label equality across the 5 frontier verdicts. No LLM grader, no reference verdict.
Permanent record & citation: doi.org/10.5281/zenodo.20344847. The Zenodo deposit mirrors the PDF artifact under a permanent DOI for citation in academic and preprint contexts.
Why no Lenz-vs-frontier comparison? You're a fact-checking platform.
A meaningful accuracy comparison requires human-labeled ground truth. We're working on a follow-up study (see below) that human-labels every claim in this corpus and compares both the frontier panel and Lenz's own verdicts against those labels. Until that ships, we'd rather publish nothing about Lenz's relative accuracy than publish a comparison that can't actually answer "who's right." This paper measures only what is measurable without ground truth: how the frontier panel behaves on real-world claims.
Has anyone measured frontier-LLM disagreement before?
Yang & Wang (2026) show top frontier models disagree on 16-38% of MMLU-Pro and GPQA items even at matched aggregate accuracy, and demonstrate that switching the annotation model in downstream scientific re-analyses can flip estimated treatment-effect signs. On real-world claim verification with rigorous human annotation, the canonical reference is AVeriTeC (4,568 fact-checked claims, multi-round annotation against 50 organizations, inter-annotator ΞΊ=0.619). Larger fact-check corpora exist β for example, 17,856 PolitiFact claims under a single-family Llama-3 ablation.
Why not use a standard fact-checking benchmark like AVeriTeC instead of building a corpus?
Two reasons. First, AVeriTeC, PolitiFact, and similar fact-check corpora have been publicly available for years and almost certainly appear in current frontier-model training data β measured disagreement on them confounds true inference disagreement with memorization. Lenz's corpus is structurally fresh: real-user submissions from the past 180 days, indexed only on lenz.io, never paired with canonical verdicts in any public training set. Second, those corpora draw from a narrower distribution (political claims from US-centric fact-checkers, often pre-screened for newsworthiness) than what real users actually ask about β Lenz claims span health, science, finance, history, tech, and legal questions in the same 4-bucket rubric.
What about benchmark contamination β did the models see these claims during training?
These are recent real-user submissions, not curated benchmark items from SimpleQA, TruthfulQA, FActScore, or other public datasets. Some claims may overlap topically with material seen in training, but they aren't paired with canonical answer keys the way benchmark items are. Retrieval-enabled models can still find sources on the live web β including Lenz's own public claim pages β though this corpus isn't a controlled contamination audit.
Why these five models?
Three frontier parametric models (GPT-5.4, Claude Opus 4.7, Gemini 3 Pro) and two retrieval-augmented (Gemini 3 Pro + Search, Sonar Pro). The split is intentional β it covers both inference modes that are common in production AI systems.
Why four buckets instead of five (with Abstain)?
In preliminary harvesting we saw some frontier models routinely decline to answer harder claims while others always committed to a verdict. An Abstain bucket would have made cross-model comparison asymmetric β a model's "I won't answer" lands in a different category from another model's confident-but-wrong answer, even though both behaviors carry epistemic weight. We force the same 4-choice set on every model so the disagreement we measure is over verdicts, not over willingness-to-verdict.
Will you re-run this?
This is a frozen snapshot (v1.0, data as of May 21, 2026). The archival URL /research/llm-disagreement/v1.0 will always serve this exact version. When v2 ships β with more claims, refreshed model versions, or methodology changes β it'll appear with a clear changelog entry; v1.0 stays at its archival URL.
What's the planned follow-up?
We're working on a companion study that human-labels every claim in this corpus and uses those labels as ground truth to evaluate both the five frontier models and Lenz's own verdict. The point isn't a leaderboard. The point is to map the structure of disagreement: where do frontier panels systematically diverge from a human consensus, where does Lenz diverge from both, how each individual model and Lenz align with the same human reference, and what categories of claims drive each kind of divergence (rubric ambiguity, temporal framing, domain specialization, calibration drift). The current paper says that the frontier disagrees on real-world claims; the follow-up will say how, on the same corpus, with humans as the reference.
Only public-facing claim fields are used: the atomic claim text and the claim's creation date. No personal data. Private and staff claims are excluded per Β§6. Frontier models received only the claim text and the as-of date β no submitter identity, no analytics signal.
If a claim is later privatized or deleted by its submitter, we can drop it from this snapshot and from any future downloads. The CSV is generated from the snapshot at download time, so removing a claim from the snapshot removes it from the public page in a single update.
a6b78be): initial frozen snapshot. Frontier-disagreement only; no Lenz-vs-frontier comparison.The twenty claims in this corpus with the widest spread between the highest- and lowest-bucket frontier verdicts. These are claims where the panel doesn't just disagree β it disagrees substantively, with at least one model picking a verdict β₯2 buckets away from another.
Ordered by max pairwise bucket distance (descending), no-majority cases tie-broken first, then by stable hash of the claim ID. Deterministic β the page renders the same examples on every load until the next snapshot.
On 67% of real-world user fact-checks in this corpus, the five strongest frontier LLMs disagree. Rely on any single one and you inherit that disagreement.
Snapshot v1.0 Β· data as of May 21, 2026 Β· code a6b78be. Citation-stable archive: /research/llm-disagreement/v1.0. Full per-claim CSV: data.csv. PDF: pdf. DOI: 10.5281/zenodo.20344847.