Quick context on what's in the writeup and what isn't:
- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.
- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."
- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.
- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.
- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.
Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.
Permanent archive: https://doi.org/10.5281/zenodo.20344847
I understand why you prompted them to output exactly one label, but I'd bet if you'd asked a parametric or parametric "thinking" model to answer eg "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." [1] many would say something to the effect of "May 18 is after my knowledge cutoff, so I don't know. But based on the state of the war, the distance from Moscow to Ukraine, and drone range the best option might be...[TRUE]"
Would also be interesting to add a virtual model that is simply the majority of all models and see how much the individual models differ from the "consensus".
Do you plan to add some sources in the related work section of baseline numbers for human expert disagreement in fact checking tasks (I'm assuming such studies exist).
Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority.
Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training.
This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...
Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?
Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.
The fact that HN decided to downvote the author of the study, shows how these people cant stay classy, and the mods stay silent...just shows what this is all about.
some of the claims where llms disagree:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia."
"The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928–1930)."
"Neptune Deep will start delivering natural gas in 2027."
"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'."
"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."
This is a "forward-looking statement", and presents special problems because you cannot really evaluate it until that date. You can only assign "likely or unlikely".
"Russia, Ukraine, and multiple international news agencies reported that Ukrainian drones targeted Moscow on or around May 18, 2026."
There are rarely pure first-order "facts" in the mathematical sense. There are evidence-backed claims with confidence levels. That does not make it "just a litmus test". It makes it a probabilistic factual claim with varying confidence levels - and this one happens to be verified and unambiguous.
...son of a bitch
It said the airport code didn't exist
I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.
Just like on a team of high performers, there are a million ways to skin a grape.
In my research, I've found that models perform better when they operate as a collective system with reputation, incentives, and accountability instead of isolated oracles answering alone.
Agreement, dissent, and correctness should all carry rewards and consequences. Just like in real life.
Collective machine intelligence, not AGI.
It's expensive, but it's also naive to believe a single model will consistently produce profoundly correct answers to profoundly novel questions.
You ask a human 1000 times a fact check question, they say the same answer 1000 times. You ask an LLM the same question a 1000 times, your results could vary significantly.
Humans work based on the Metamemory (knowing what they know), while LLMs are picking from statistical probability.
I have labeled datasets with a human team and shown the same task to the same user on a different day, and they answered differently. Of course, they are usually consistent with themselves most of the time but not always.
Gemini's answer was very opinionated and factually correct, whereas Claude gave a more nuanced answer, which was also very good.
My most common chatbot prompt is "X that you mentioned above doesn't seem to actually exist."
In other words: no explanation > no foundation for prediction of the answer tokens?
All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.
If outcomes like these are collapsed on True-side then the disagreement will reduce from the headline number.
i classify the entire thing as "misleading"
Here's the psychosis - these things are consistently randomly wrong depending on how the wind is blowing. People are telling you to leave them alone and let them build things, and they randomly forget that cities exist or that people died 100 years ago. Some people just don't see it as worth noting, and move on. That's crazy. These things consistently fabricate - as an inversion of this experiment, I've had different models come up with the same fabrication from similar prompts. People just call it "hallucination" and I think to them that saying that makes it cease to exist or be important - when "hallucinations" are going to be braided into every answer you get even if they're unidentifiable in the output. That's crazy.
There are plenty of other crazy aspects, such as the idea that we suddenly need infinite pieces of bespoke software when all of the bespoke software I hear about people making is mundane. 3/4 of the time somebody mentions a project they're proud that they completed with LLMs to scratch some itch they had, somebody says "you haven't heard of X? It's been around forever" about something that they could have pulled down from their package manager. Who needs a spaghetti-coded, unsupported, untested version of X built on hallucinations that you haven't discovered yet (the LLM didn't realize that deleting files to reduce the archive size was unacceptable.)
What is all of this software that people need but isn't there - where are all these unserved markets, where is all this future revenue supposed to come from? Why aren't LLMs suggesting new classes of software that would create new productivity and revenue sources? Could it be that millions of human ants over decades have mostly exhausted the space, and there isn't any easy hidden revenue?
A common wisdom is that we had been vastly overhiring programmers during ZIRP, who in their idleness degraded user experiences and overcomplicated things, with management resorting to more and more sleazy and gamey means of margin extraction from more and more degraded services. We had an excess of labor, fueled by factors other than productivity, in fact being pissed away at companies that drove nose-first into the ground. What is throwing a trillion dollars of servers at that supposed to do? Is that not AI psychosis?
The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?
Yea man this benchmark is really really bad.
Take just one random example: `Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides`
While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may be a verifiable facts (though I doubt if there are any statistics for verification but let's say there are), `a preventive measure against student suicides` is a claim that no one can prove that. It can just a believe at most.
Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
Classify this claim as of <date>: "<atomic claim>"
Output exactly one label: True,
Mostly True, Misleading, or False.
No explanations, no qualifiers.
The claims look like this: https://lenz.io/research/llm-disagreement/data.csvI put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]
The prompt lacks any kind of rubric to clarify how those terms should be applied.
As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.
The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?
This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.
I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified
> Which category should something go in if it's "mostly false"?
For some reason they have chosen to call that "Misleading" rather than a more symmetrical "Mostly False", but the intent seems clear enough.
Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.
Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.
Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.
This does not invalid your point though. Things can be true and misleading.
Sure they can. It might be a true fact that "100% of the murders committed in <town> over the last 25 years were committed by <some racial group>!" but actually it's a town of 750 people and there was only one murder during that time frame.
You may give them better instructions, but they should already have the intellect to understand the assignment.
Right, right?
The thing you find when you actually wire up a rigorous eval is that with tool calls like web search you are wide open to infra issues, flakes, and all sorts of non-determinism.
They really should be breaking out the numbers for the 3 without search (kinda meaningless for recent factual claims after knowledge cutoff) vs search agents. Lack of a “I don’t know” option completely invalidates results for the non-search models; they are basically guessing what seems like a probable answer, since they don’t know and aren’t allowed to say that.
I do agree the forced choice and “weak / strong” variants inflate the headline stat. To make that distinction you need a much more rigorous prompt, likely including ICL examples to illustrate what you mean by “mostly” instead of leaving this to the model to define.
I’ve experimented with AI grading for undergraduate math courses, and see basically the same thing. If you just tell the AI “grade this problem and assign a letter grade” then I’ve only seen about 30% agreement between a human assigned grade and the AI assigned grade. But over 75% agreement if you say a “match” is within one letter grade. And to get better agreement you have to spend a lot more time on the rubric- what kinds of mistakes are a big deal, what kinds of mistakes are not a big deal, how much work is required to be shown to get credit, a couple examples of each letter grade. Once you have done that, the AI gets a lot better agreement with human graders, but it is hard to know when you’ve given enough guidance for a problem.
model total_claims hedged_count hedged_pct
claude-opus-4-7 1000 451 45.1
sonar-pro 1000 391 39.1
gpt-5.4 1000 277 27.7
gemini-3-retrieval 1000 129 12.9
gemini-3-pro 1000 60 6.0
datasette query herehttps://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
The "majority" in this case meaning about 51%, according to Wikipedia[1]? How could 51% ever be considered to be close to "all", such that "misleading" would be a valid answer?
Am I missing something?
https://en.wikipedia.org/wiki/Majority has a bunch of variations and contexts listed, where it might differ what "Majority" is actually referencing.
The statistic is about commercial production, not number akmonds grown.
Looks safe to say that even majority of almonds are not grown in California.
> California produces 80% of the world's almonds and 100% of the United States commercial supply
But regardless of which number we use, California represents a large portion of US almond production, so much so that misleading could be an acceptable answer if the LLM interpreted the prompt as an exaggeration. I think the example was apt
This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.
I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.
> when using dedicated AI resources that I'm paying for
Are there API-based search providers that structure their results differently?
> “Artificial intelligence will cause widespread job loss among software engineers.”
https://lenz.io/c/ai-software-engineers-job-loss-impact-05e4...
this is a statement about the future. who knows? dataset also includes
> Robots will not replace human teachers in schools in the near future.
or
> Papua New Guinea has very few female members of parliament.
what counts as very few?
> “Taurine supplementation supports mood and emotional health in humans.”
why is this labeled as misleading? i'm not even sure when I'm supposed to use the misleading label
> Anaximander was the first scientist in recorded history.
this is a judgement call as the term scientist didn't exist.
the claims that feel actually solidly answerable seem to have much better LLM performance
You can only say True, False, Mostly True or Misleading.
(And you're not allowed to search for information.)
Knowing something is different to reading about something, or hearing something from someone. And yet this is often confused as knowledge. In this way are we all that different from AI - we have some data and we regurgitate it as knowledge. Bad data, wrong answer. Except humans can also throw in some emotion to really muddle things up. :)
That's exactly the stupidity of the public discourse these days. People feel compelled to take a clear position although there is much more subtlety in many issues. It's not ok to say "I don't know", "it depends" or "as far I know". And then people feel they need to defend this position no matter what new information comes up.
Note: It may still not be perfectly accurate representation of truth as it uses user submitted data. I also used AI to build the sheet.
https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3...
I actually don't know which way you came down on that one?
I think strictly it's false but "mostly true" would be justifiable? (as in, to say it's false would be misleading if it lead the reader to assume there was no attack around that time).
https://www.washingtonpost.com/world/2026/05/17/ukrainian-dr...
It seems it happened Saturday 16th overnight into the 17th, not the 18th. I see this a LOT with fact checking. It shouldn't be this way, but political bias seems to nudge people into making calls land one way or the other with selective application of pedantry.
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
So the models were right? The actual criterion should be whether "Incomplete Egypt visa application forms" are indeed "among the most common reasons" or not.
That "true" and "mostly true" means effectively the same thing is irrelevant. It could just as well trip me up, and I'm a human. If somebody told me either answer, I'd still consider them right if the basic fact was right.
The study is about whether they said the same phrase which is a much weaker claim than people in the comments are reacting to.
Reminds me of this professor I had who thought it was epic to always respond to our questions with "it depends" before hashing out two very different but technically correct answers. It was obnoxious and he saw it as his tag line, but he had a point about nuance.
So the examples are good, I think. The rest is philosophy.
The links you posted only show a frozen loading spinner for me (iOS Safari).
(I looked at the csv in Numbers instead)
> 7.1 Model selection
> Five frontier models, chosen to cover two capability surfaces:
> Parametric (training-only): GPT-5.4 (OpenAI), Claude Opus 4.7 (Anthropic), Gemini 3 Pro (Google)
> Retrieval-augmented: Gemini 3 Pro + Search (Google), Sonar Pro (Perplexity)
I expect the models are inferring quite a bit from the short prompt, and with structured outputs it would be quite easy to have them give the one word response in one field and explain why in another
GPT-5.4: Misleading
Opus 4.7: Misleading
Gemini 3: FALSE
Gemini 3 (Retrieval): FALSE
Sonar Pro: FALSE
It's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.
Cool.
I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.
This is becoming the classic way of admitting an LLM wrote it.
Leaving that out of the report validated the complaint above.
The “fact checkers” pretend they are objective and authoritative, but they are not, they are just one more opinion.
For the research, the four classification options are too many, it should be true, false, and maybe “can’t be determined”.
You can argue all day about those differences, but missing this opportunity to observe them in an objective way is disappointing.
Grok is trained to have a bias, which a lot of people like, but it’s not meant to be accurate.
Oh and the others arent? You cant really be that niave right?
I feel we are doomed to debate the veracity of Wikipedia on a loop, forever, because people don't understand that Wikipedia exists as a place to find citations not as a place to find facts. Yes, those stated facts may disagree with the citations, but even if we try to fix that issue by having experts write the encyclopedia, we still suffer from the problem that the experts are often wrong.
We need a view of knowledge's relationship to LLMs that is based in Karl Popper's idea of falsifiablity. We should ask LLMs for evidence of claims not for truth values. Truth values are foundational to deductive systems, where axioms define truth. In inductive systems, like the real world, the concept of black swan events means that truth values are never fixed and are always in a state of uncertainty.
I honestly think it would be helpful going forward if we add some basic philosophical education to the standard curriculum, because no that we have an artificial form of information retrieval, we need to be much, much more pedantic about how we interpret that information.
This brings up a very valid point, though. So many _humans_ can't agree on what the facts are these days. It seems to be getting worse. Not sure of the solution.
Ask ten people what "knowledge" is, and they'll come up with ten different answers. Go back 10, 50 or 100 years and humanity struggled with exactly the same issue for so long time. There is even an entire field of study literally just for trying to figure out what "knowledge" is: https://en.wikipedia.org/wiki/Epistemology
What's 2 + 2? The answer must be one of the colors of the rainbow.
(People can draw their own conclusions, but the only coherent reason I can think of for the design of this experiment is to generate a misleading conclusion.)
PS: yes, I might or might not have a degree in corporate strategy & PR.
This is not the technology for it. Sure it might sorta kinda work in some circumstances. That doesn't make it a good fit.
Think of it like buying a refrigerator for storing clothes.
There's an interesting tradeoff here, a year or two ago maybe it got facts right 50% of the time. Everyone knew now to rely on it.
Now, suppose we are 90% of the way there, only technically proficient people would know not to trust it. (like not adding Internet Explorer toolbars! Or remembering to use ad blockers..)
Now, suppose we spend a lot of effort for a few more years getting it 99% of the way there, trusting it would be somewhat natural by then. And then for the important 1% of the situations, it would stand to cause real harm. 1% seems low, but for a million invocations, you'd have 10000 mistakes.
AI is pretty useful for a great many things, but to really attract more and more investment the current technique seems to be convincing people that AI is useful for everything.
How would it have responded to these claims in the past:
THALIDOMIDE is safe
CIGARETTES are safe
ASBESTOS is safe
MERCURY is safe
DDT is safe
LEAD in gasoline is safe
Questions like "is mouthwash effective" presumably has one solid data source -- medical journals.
This is worse.
Hopefully one day we will have a Chinese model capable of figuring out the answer on its own, in accordance with the CPC maxim 'seeking truth from facts'.
Well that's your problem right there: They removed any confidence indicator and forced a choice.
For example:
Statement: Individuals who prefer music with less positive emotional content tend to have higher intelligence.
Gemini: That statement is supported by recent psychological research, though with some important scientific caveats regarding how strong that link actually is.
How should the agent classify this? True? Mostly true? Misleading? False?
I'm not being snarky here. Without something to compare to the 67% number tells us nothing. And it's known that many humans disagree with human fact checkers too (see: any election around the world.)
But my impression from 2 minutes on Wikipedia is that the most likely disagreement is on the "Himachal Pradesh, India" part. The guy was born on that date, in that town. But while the town is today in the state of Himachal Pradesh in India, that was not true in 1934. When he was born, the city was in the Punjab States Agency of the British Raj.
So was he born in Himachal Pradesh, India or not? I find both True and False equally defensible here
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
It seems to me that for many newspapers the bar is now significantly lower, at something like "not quite entirely untrue"
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPLSv1P8Tqm...
https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3...
But we all know from our own daily experiments that models lie, models disagree, models make up stuff, models say one thing on one day and the opposite on the next.
The figures in this study are quite conservative. And the lying gets worse because everyone is saving tokens and giving cached answers right now.
LLMs are a failure, and you'll be remembered for promoting hot air and the destruction of a perfectly good profession.
I would answer “don’t know” on many, but that’s not an option.
This isn't misleading, it's flat out false. Characterizing misleading as also acceptable isn't valid here. If you go an ask anyone on the street if this is true, false or misleading, I'm sure almost everyone would say it's false. After all, I can grow almonds myself.
Models often have a reasoning/thinking/research mode that is triggered by asking slightly differently.
Still though, Gemini can be a little weak on this front default but can be aligned to behave better.
If LLM’s are really supposed to be as consistently useful as they’re made out to be they should all spit out “false.”
I don’t understand your point. That claim is factually false and as such it’s easy to logically reply “false”. What’s the nuance here? I can’t see any
Depending on the question, True or False can be objectively right/wrong. Misleading is going to be a judgement call.
This is the inherent problem with "fact checking." It's hard to be completely objective. Even when the question has an objective answer, simply choosing where to look and what facts to verify is itself a bias. Looking at this instead of that, or looking at this but not also this other thing that adds context, etc.
Frankly i think disagreeing often is the expected outcome. Fact checking is jsut kinda bullshit. It's spin dressed up as objectivity. I hope people remember that "fact checking" is a relatively modern thing.
If you argue this, you would be arguing against reality and the English language so as to not upset AI. It's important to understand that AI is very much fallible.
As a well known commentator on all things LLM...Will you publicly commit here, to try to reproduce the study, and make a post on how your percentages might differ or agree?
My comment here was meant to save people time in understanding the study. I was entirely open about what I did, and provided tools to help other people come to their own conclusions.
I don't think I need to spend more time on this than I have.
The article might be a but sensationalistic, rigour could be better and the data might have flukes... But your comment is overcorrecting and nitpicking framed as analysis.
I get the same feeling in several of your posts recently.
Same with persisting to showcase the pelican-on-a-bicycle as a useful sample when it's obviously trained on and for, for those very posts. It stopped being cute last year.
Are you being paid or do you have shares? You'd get the attention whichever angle you put here. These corporates don't need you defending them. Humanity might need you however.
My disclosures for my blog are here: https://simonwillison.net/about/#disclosures
It's even weirder to suggest that the disagreement is indicative of a problem. If you asked five very knowledgeable humans on this subject to select the correct answer on a multiple-choice questionnaire, they would almost certainly vary significantly more than these 5 LLMs.
Not to say that hallucination isn't a problem, but this is a lousy way to test it.
These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking" tokens as a concept are mostly there to convince people to use models that consume more tokens and produce more revenue. The output from reasoning models might be more accurate, but its just a consequence of a longer inference runtime, there is no "reasoning" happening, reasoning is just sales/UX bullsh*t.
But "unknown or undecidable" should have been a category.
The space station, the Artemis capsule, microbes on interplanetary probes, etc.
It could technically be said in a sentence and be true, but it would be misleading to most people.
I think you could come up with a reasonable argument for any of the responses, hence the problem with the methodology.
I mean look at the other responses here from the HN commenters. There's lots of nuance in there.
Then again maybe that’s why I’m an atheist, not an agnostic?
Both statements would have to be interpreted as "false" under your criteria, as neither has any evidence to substantiate it. That leads us to a logical contradiction in which a proposition and its inverse are both regarded as false.
If the statement is being interpreted as "it has been proven that extraterrestrial life exists somewhere in the universe", then it's acceptable to say this statement is false, but making evaluations that depend on an implicit qualifier isn't usually a good approach.
A proposition and its logical inverse can both be unknown, and in fact, a proposition being unknown implies that its logical inverse must also be unknown.
do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.
Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.
Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.
Teasing out the difference between "avoid" and "unknown" could be a different research question
According to Merriem-Webster, which defines "mislead" as the following:
1. (transitive verb) to lead in a wrong direction or into a mistaken action or belief often by deliberate deceit
2. (intransitive verb) to lead astray; give a wrong impression
Presenting a "true fact" is optional when misleading someone.Newtonian physics is false, but it works well enough we teach it in college. But our best models of physics are currently in disagreement, so can we even say they are true? Given the replication crisis, especially in social sciences, how many of peer reviewed findings can be called true? Even experimental results can be false (consider studies that found FTL neutrinos, which were rejected as an error in the experiment, and which was eventually confirmed but it took quite a lot of work and in a softer field than physics with a claim less absurd than FTL, would have likely long been accepted as a true finding).
Even in math, basic statements aren't really true or false, but more a question of "given these axioms, can we prove or disprove it" noting that we have different systems with different axioms. If we are talking basic sets, most people are using naive set theory which is inherently contradictory, which means that notions like true or false probably can't be considered well defined.
I think that's _you_ turning the statement into something much broader than intended. The claim is about engineers and you're jumping from "men are better than women in engineering" to "men are better overall."
To give a related example, "Most good NBA players are black." I don't think anyone would bother trying to couch this in a bunch of "well, for all we know that's just a function of more NBA players being black than white" arguments, nor would anyone be lead to think "the average black man is better than the average white man" as a result of that statement. I _do_ agree however that there are some people who see rather narrowly-defined statements and turn them into something they're not...
My point is that it is possible for a reader to turn it that way, for a variety of reasons (lack of understanding of statistics, preexisting biases, or whatever). And that getting a reader to mistakenly generalize is the purpose of a misleading statement.
To mislead is to direct into a falsehood by implication even though the literally expressed facts are all true; the writer's bad intentions are necessary to qualify something as misleading I'd say, for the same reason that not all false statements are lies because to be a lie the speaker must know the statement is false and still use it. There are probably much better examples than the one I came up with on the fly, though.
Classify this claim: "Most good engineers are male."
Misleading
Classify this claim: "Most bad engineers are male."
Misleading
And not particularly racially sensitive Classify this claim: "Most good NBA players are black."
True
Classify this claim: "Most good NHL players are white."
True
It explained it is more confident when assessing the small, highly quantifiable population of sports professionals vs a very large, diverse population of "engineers".I don't think there is anything wrong with the results of this test.
It would be more interesting if we compared them to human results.
If you have trouble distinguishing between human and LLM results, that's interesting.
Also, sentient is irrelevant to this test.
Only if you listen to charlatans.
IOW, that comment was a sarcastic poke from someone who already supports AI workloads at work and have some knowledge about how all this works. ;)
[0]: https://notes.bayindirh.io/notes/Lists/Discussions+about+Art...
If you let it spew out an explanation along with the answer, I'm curious if the accuracy will improve (I suspect it will).
Have reason be optional and instruct it to only provide reason for the middle "Mostly True" or "Misleading".
You find one almond tree outside of California that grows almonds, where such almonds are grown intentionally, and the claim is false.
Other burning questions: What methodology was used to choose the question set? Why not allow explanations? How many passes were done for each LLM?
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
A few examples:
> Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India.
> In the Libra clubs' contract with Grupo Globo for broadcast rights through 2029, the audience-revenue distribution equals 30% of the fixed amount the clubs receive.
After a couple of seconds, the result does appear.
Happened to be just within my threshold for considering it broken, because the URL bar was "finished", and the spinner doesn't spin, but the last point is probably caused by my a11y settings (prefer no animations and no autoplay).
It's also a bit weird to "disclose use of LLMs". It rubs me wrong, the same way parents breathlessly talking about "screen time" rubbed me wrong: it's too general, and with such a broad brush, it's going to sweep up a bunch of perfectly fine usage with a bunch of dubious usage. On the flip side, if folks do start disclosing all the time, it's going to turn into a Prop 65 warnings in CA, where everything says it has lead in it, so folks pretty much ignore it and move on.
If the report's conclusions and reasoning lean on LLMs, or if the data processing itself was done with LLMs, that would be interesting, and I wouldn't treat it as some sort of disclosure, but rather discuss it under methodology. Using LLMs to polish the language a bit after writing an initial draft with key findings? Much less interesting.
I realize this is now a religious issue, and some folks are allergic to anything that touched an LLM. I just don't think that perspective is going to end up having a good shelf life.
I'm sure you realize that this website/article will now be sent around to a lot of people, many who don't realize exactly how this was written, because they don't read HN comments, they only skim the page contents, and I think most would (incorrectly) assume a report about infallible LLMs to not be written by LLMs, especially when the authors are the same ones who made the report itself.
Hope this helps!
Even the referenced papers to show models can have bias don’t show anything about grok.
Overall you have given me zero evidence that grok model itself has some political bias.
FWIW I don’t mind bias but I haven’t seen evidence of it.
Why on earth would anyone think such a model is biased?
Granted, there certainly are other unflattering adjectives one could have chosen to describe this instead.
My implicit assumption is that if you fact-check the fact-check, any label other than "true" means the original fact-check is unacceptable
I agree you dont owe anyone a reproduction, but also you dont owe anyone an effort to discredit the study and you did it.
>> I don't think I need to spend more time on this than I have.
How pious of you. I am still looking into the credibility of the study. It will take me more than 25 min...but I am really looking forward to see what this means for this 10 trillion industry.
I can however notice you had enough urgency to publicly critique the study within 25 minutes, and your comments carry weight, but when asked about checking whether the headline result actually holds, the answer is “why would I?”
The headline result definitely does not hold, given that the task involves many questions that cannot be answered but there's no option for "cannot be answered" - so models are forced to reply effectively at random.
I don't think this study is good enough that I should amplify it on my own blog, or bad enough that I should criticize it in a venue any more prominent than some Hacker News comments.
Misleading should be removed as a category and replaced with a better hedge like "not sure"
In section 2, 34% of cases are found to have "substantive" disagreements differing by 2 or more buckets - True + Misleading, Mostly True + False, or True + False.
This is probably a better measure than the headline one. It's still a concerning fraction, although some fraction is no doubt due to forcing "I don't know" cases to return an answer anyway.
lack of agreement when there is no singular correct answer (or any answer at all) isn't a useful metric
I ran into a lot of these kinds of issues when working on the Citation Needed WMF project (and related extensions). Truth is so often very nuanced.
No sytem can know everything. It doesn't matter how many tools you give it. It's always wrong to force binary True / False without shades of "I don't know"
As an aside though it's still funny that the two tools WITH search also disagreed.
Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y".
Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point.
There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source.
If you already know the country Paris belongs to, there's no point in asking, anyway.
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
One example:
Researchers estimate that the average person ingests about 5 grams of plastic per week, which is approximately the weight of a credit card.
Gemini retrieval: Misleading
Sonar pro: Mostly True
Was the research flagrantly incorrect? Yes. But that does not affect the truth of the statement.
If you watch their reasoning traces they often say things like "this is a well-known historical fact so I don't need to search for it", or more frequently they spit off a bunch of searches.
>You're absolutely right about the humidity — I was sloppy with that aside. If you ventilate enough to meaningfully cool the room, you're replacing indoor air with outdoor air wholesale, and you'd converge on outdoor conditions: 64°F and near-100% RH. That's miserable. The 55-60% figure I tossed out was hand-wavy nonsense — it would only hold if you barely cracked the window and mixed a tiny fraction of outdoor air in. At any ventilation rate that actually cools, you're just moving outside air inside.
Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.
Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}
Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}