Disagreement Among Frontier LLMs on Real-World Fact-Checks

Here's the prompt they used:

  Classify this claim as of <date>: "<atomic claim>"

  Output exactly one label: True,
  Mostly True, Misleading, or False.
  No explanations, no qualifiers.

The claims look like this: https://lenz.io/research/llm-disagreement/data.csv

I put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".

I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.

The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]

The prompt lacks any kind of rubric to clarify how those terms should be applied.

As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.

Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."

The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.

Update 2: a much better example:

"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"

The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.

The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

This shouldn’t be surprising. Let’s start off with the obvious. What does “real-world fact-check claims” mean? So we’re using the same list of “fact check claims” on each model. The problem is (unless I’m missing it) the authors aren’t exposing the list of 1K questions they used in the experiment. That’s a huge problem. Are the authors assuming the 1K claims they used are “provably true”? If so, that’s a huge bias, and opens up a philosophical debate about what it a fact? Or what’s makes something true/ false?

As Marc Andreessen puts it: a particular domain is either explicitly “provable” or not “provable”. Provable domains include math, physics, chemistry, biology, engineering, even code. That not be the whole list, but everything else is essentially “unprovable”. At least as far as a language model is concerned. They are questions that require a human value judgement. Politics are an obvious example. So back to the “1K fact check claims“. How many of these are political, or current events questions? How many are STEM questions that can be laid out in a formal proof?

Models can be trained to answer either way on claims that require a value judgement, but that’s obviously not beneficial to anyone except who controls the model. If the expectation is that all these frontier models should answer the same way on value judgement questions, then that’s never going to happen. What the models ARE good at though is breaking down the nuances of a topic and arguing both sides. This is how these tools should be used, as a way to analyze the claim and let us humans in the end make our own value judgement. If you’re trusting the model to make the value judgement for you and just accept it as a fact, then you are entering a a very dangerous territory.

"Extraterrestrial life exists somewhere in the universe."

GPT-5.4: Misleading

Opus 4.7: Misleading

Gemini 3: FALSE

Gemini 3 (Retrieval): FALSE

Sonar Pro: FALSE

It's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.

Very interesting tool, but it's biased and not neutral from the get go, because I explicitly formulated claim in neutral way, but it automatically rewrote it to be western/wikipedia POV and then immediately proceeded to verify it.

original neutral:

  US DEPT OF DEFENSE/DNAVFAC planned renovations to School #05 in Sevastopol, Crimea in 2013 before Crimea became part of Russia in 2014

automatically rewritten to biased western view:

  The United States Department of Defense, via the Naval Facilities Engineering Command (NAVFAC), planned renovations to School No. 5 in Sevastopol, Crimea in 2013, before Russia annexed Crimea in 2014.

https://lenz.io/c/73c0f16c

> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.

Cool.

I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.

I think we can all agree that this experiment being flawed in multiple ways is TRUE. But I think it's a great exercise in identifying common mistakes people make when using LLMs. This would be a great interview question for a prompt engineering job.

Why did they exclude Grok? Given the published philosophical differences in how Grok is trained, it would provide an interesting data point.

You can argue all day about those differences, but missing this opportunity to observe them in an objective way is disappointing.

It's becoming increasingly clear to me that - at least right now - AI is only useful for 2 things:

1. Coding, with it being more useful the better you are at coding without AI

2. Any expert in their field asking questions about their field, who bother to fact check the output. E.g. "claude pls search these 1000 files and tell me if you find anywhere that they're discussing the settlement" and then the user checks the files/line numbers to make sure that it's correct - basically a turbocharged search that may have false negatives (content existed but I didn't find it) or false positives (content that I classified in a certain way but it was wrong). It takes an expert to tell the latter one in some cases.

They get more human by the day.

Watch the disagreements in real time via refinement pipeline on the results page pingpongit.com

I don't get why everyone is hellbent on getting LLMs to perform fact checking.

This is not the technology for it. Sure it might sorta kinda work in some circumstances. That doesn't make it a good fit.

Think of it like buying a refrigerator for storing clothes.

Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".

PS: yes, I might or might not have a degree in corporate strategy & PR.

GIGO is an acronym I learned in the 1970s. Things haven't changed much since then.

We live an an era where people have "their own truth", so why not let the AIs have theirs too?

The AI companies have editorial privilege on the content they feed their LLMs, and on the prompts that the users never see. I don't know why they feel a need to interfere when their AI produces something that's politically incorrect. Perhaps it's because they have a fundamental credibility problem with their products...

For 100% local CPU fact checking, I made this: https://news.ycombinator.com/item?id=48301003

What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance

> No Abstain option is offered (a forced choice keeps the comparison symmetric across models).

Well that's your problem right there: They removed any confidence indicator and forced a choice.

For example:

Statement: Individuals who prefer music with less positive emotional content tend to have higher intelligence.

Gemini: That statement is supported by recent psychological research, though with some important scientific caveats regarding how strong that link actually is.

How should the agent classify this? True? Mostly true? Misleading? False?

This is an odd one. The paper is real, but was written by Claude? I am assuming OP is human, but also appears to be using Claude to post.

It’s just shows that fact-checking is not a thing for 99% of the cases. It’s interesting to see it in LLMs, but it’s not unique to them.

The “fact checkers” pretend they are objective and authoritative, but they are not, they are just one more opinion.

For the research, the four classification options are too many, it should be true, false, and maybe “can’t be determined”.

The difference between "mostly true", "misleading", and "false" is context, and responses are specifically not allowed to include any context. Even "true" has a little context, since few things can be said to be absolutely true. "Unknown" also isn't allowed.

What's 2 + 2? The answer must be one of the colors of the rainbow.

(People can draw their own conclusions, but the only coherent reason I can think of for the design of this experiment is to generate a misleading conclusion.)

What’s the point of this if they didn’t use temperature=0 for every model (they didn’t)?

They could have redone the test against the same model and gotten different answers. It’s almost like picking 2 different coins and comparing the list of coin flip results. (I realize it’s not that straightforward, it’s not 50/50, but it’s essentially the same issue.)

It's a prompting issue rather than an LLM issue. The guy needs a "Prompt 101" course.

I hate to get really pedantic here, but the concept of "truth claims" plays fast and loose with concept of knowledge in a philosophical sense. The idea of "fact checks" misunderstand how information and knowledge work together. Knowledge is about evidence, not "facts" because facts are a shorthand for a preponderance of evidence.

I feel we are doomed to debate the veracity of Wikipedia on a loop, forever, because people don't understand that Wikipedia exists as a place to find citations not as a place to find facts. Yes, those stated facts may disagree with the citations, but even if we try to fix that issue by having experts write the encyclopedia, we still suffer from the problem that the experts are often wrong.

We need a view of knowledge's relationship to LLMs that is based in Karl Popper's idea of falsifiablity. We should ask LLMs for evidence of claims not for truth values. Truth values are foundational to deductive systems, where axioms define truth. In inductive systems, like the real world, the concept of black swan events means that truth values are never fixed and are always in a state of uncertainty.

I honestly think it would be helpful going forward if we add some basic philosophical education to the standard curriculum, because no that we have an artificial form of information retrieval, we need to be much, much more pedantic about how we interpret that information.

What's really weird to me is that "I don't know" is not a valid answer in this experiment while we can all agree that's the main issue with LLM right now is that they will happily "roleplay" an answer when they have nothing in their dataset corresponding to your query.

And how many claims human experts disagree on in the exact same setting?

I'm not being snarky here. Without something to compare to the 67% number tells us nothing. And it's known that many humans disagree with human fact checkers too (see: any election around the world.)

This is wrong on so many levels, from data through process to evaluation. How do you even prompt claude not to give you Pearson for correlating them.

One fun example: "Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India". Opus and Gemini believe this to be true, GPT 5.4 believes it's false, Sonar thinks it's mostly true. Disagreement value of 3, you can't disagree more than some models thinking it's true, some thinking it's false

But my impression from 2 minutes on Wikipedia is that the most likely disagreement is on the "Himachal Pradesh, India" part. The guy was born on that date, in that town. But while the town is today in the state of Himachal Pradesh in India, that was not true in 1934. When he was born, the city was in the Punjab States Agency of the British Raj.

So was he born in Himachal Pradesh, India or not? I find both True and False equally defensible here

https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

https://en.wikipedia.org/wiki/Ruskin_Bond

Sounds like a lot of room for human bias.

How would it have responded to these claims in the past:

THALIDOMIDE is safe

CIGARETTES are safe

ASBESTOS is safe

MERCURY is safe

DDT is safe

LEAD in gasoline is safe

Five frontier LLMs 100% agree that the title is misleading.

Dissent and consensus among frontier models is a good thing.

Just like on a team of high performers, there are a million ways to skin a grape.

In my research, I've found that models perform better when they operate as a collective system with reputation, incentives, and accountability instead of isolated oracles answering alone.

Agreement, dissent, and correctness should all carry rewards and consequences. Just like in real life.

Collective machine intelligence, not AGI.

It's expensive, but it's also naive to believe a single model will consistently produce profoundly correct answers to profoundly novel questions.

That's better than all agreeing on the wrong answer, however.

More interesting part probably worth highlighting: The SAME model won't always return the same output when prompted with the same fact check.

You ask a human 1000 times a fact check question, they say the same answer 1000 times. You ask an LLM the same question a 1000 times, your results could vary significantly.

Humans work based on the Metamemory (knowing what they know), while LLMs are picking from statistical probability.

"None of these claims is older than February 15, 2026"

All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.

I’m no expert but if LLMs are token prediction machines, and you tell it to not build an explanation before the answer, isn’t it less likely that the token prediction for the final answer will have less raw material before it to build a grounded response?

In other words: no explanation > no foundation for prediction of the answer tokens?

As an example, 2026 GPT doesn't even agree with its 2025 self. Last year I asked it to make a hardware comparison and it correctly identified the objectively better option. Recently I asked again and this time it got everything completely backwards.

No human baseline to compare it to. Without that you are missing an important check on the task being poorly constructed. More importantly there is an implied reference thats missing. The implication is that people would have done better, or that perfect agreement is possible.

Tell me about it. I spent a week back and forth between four models (ChatGPT, Claude, Gemini, Grok) trying to enhance a PPMI algorithm. They couldn’t agree on anything. One was refuting what the other said. Eventually I decided to follow what Claude suggested because its explanations made the more sense.

Why do we want to build intelligence if it just confirms what we already think we know?

Honey does not spoil over time under normal storage conditions.,2026-02-17T04:11:51.495452+00:00,Science,True,True,True,True,Mostly True,1

If outcomes like these are collapsed on True-side then the disagreement will reduce from the headline number.

Totally aside from disagreement between models unbiased by prior input any such experiment may fail to capture the outcomes experienced by real users whose prior text exchanges may substantially change the text recieved.

For instance see the folks who think that they have "awakened" their instance of ChatGPT.

Actual usage may diverge to a greater degree than models

between the bad methodology, bad selection of 'facts' (some are predictions, some are opinionated, etc.), and ai-written report without disclosure... i dont get why this so high up on the front page. this is, frankly, a worthless assessment.

i classify the entire thing as "misleading"

It's right, you must be professional than llm

"Jewish people control the world and the global economy more than any other group on a per-capita basis."

Hopefully one day we will have a Chinese model capable of figuring out the answer on its own, in accordance with the CPC maxim 'seeking truth from facts'.

Inject some adversarial priming as is in actual usage, and you can probably get that number to >=95%

And they could all see exactly why if they chose to. https://huggingface.co/spaces/RiverRider/srt-introspect

Personally I find that every llm I use is unable to consistently identify the latest npm version numbers of the node packages that I use.

People keep asking "where is the psychosis?" as a reply to people on the rapidly multiplying "CEOs have AI psychosis" threads that have been popping up here and cross-pollinating in the mainstream media for the last week or two.

Here's the psychosis - these things are consistently randomly wrong depending on how the wind is blowing. People are telling you to leave them alone and let them build things, and they randomly forget that cities exist or that people died 100 years ago. Some people just don't see it as worth noting, and move on. That's crazy. These things consistently fabricate - as an inversion of this experiment, I've had different models come up with the same fabrication from similar prompts. People just call it "hallucination" and I think to them that saying that makes it cease to exist or be important - when "hallucinations" are going to be braided into every answer you get even if they're unidentifiable in the output. That's crazy.

There are plenty of other crazy aspects, such as the idea that we suddenly need infinite pieces of bespoke software when all of the bespoke software I hear about people making is mundane. 3/4 of the time somebody mentions a project they're proud that they completed with LLMs to scratch some itch they had, somebody says "you haven't heard of X? It's been around forever" about something that they could have pulled down from their package manager. Who needs a spaghetti-coded, unsupported, untested version of X built on hallucinations that you haven't discovered yet (the LLM didn't realize that deleting files to reduce the archive size was unacceptable.)

What is all of this software that people need but isn't there - where are all these unserved markets, where is all this future revenue supposed to come from? Why aren't LLMs suggesting new classes of software that would create new productivity and revenue sources? Could it be that millions of human ants over decades have mostly exhausted the space, and there isn't any easy hidden revenue?

A common wisdom is that we had been vastly overhiring programmers during ZIRP, who in their idleness degraded user experiences and overcomplicated things, with management resorting to more and more sleazy and gamey means of margin extraction from more and more degraded services. We had an excess of labor, fueled by factors other than productivity, in fact being pissed away at companies that drove nose-first into the ground. What is throwing a trillion dollars of servers at that supposed to do? Is that not AI psychosis?

LLMs will be great politicians one day

Simple: If it claims to be a fact check it's just propaganda.

I like ChatGPT a lot but it is always trying to debate and disagree when you ask it simple non-controversial questions. Trying to turn everything into a debate session instead of just answering the question.

One of the claims it asks LLMs to grade is "Artificial intelligence will cause widespread job loss among software engineers."

Yea man this benchmark is really really bad.

Here's the prompt they used:

  Classify this claim as of <date>: "<atomic claim>"

  Output exactly one label: True,
  Mostly True, Misleading, or False.
  No explanations, no qualifiers.