Simon PaxtonA 100‑question “bullshit benchmark” sounds like a joke until you see the chart. In BullshitBench v2,...
A 100‑question “bullshit benchmark” sounds like a joke until you see the chart. In BullshitBench v2, Anthropic’s Claude models sit at the top, flagging nonsense prompts as nonsense far more often than comparable ChatGPT and Gemini models — a concrete data point behind the online refrain that in Claude vs ChatGPT, Claude is “the least bullshit‑y” AI.
TL;DR
BullshitBench, created by Peter Gostev, does one simple thing: it asks language models 100 carefully‑crafted nonsense questions across software, finance, law, medicine and physics, and scores them on whether they detect the nonsense and push back instead of answering confidently. The public leaderboard shows recent Claude models with much higher “green” detection rates than OpenAI’s GPT series and Google’s Gemini models — meaning Claude more often says some version of “this doesn’t make sense” rather than forging ahead.
Healthcare executives told Becker’s Hospital Review this is exactly why they are piloting Claude in clinical settings. One describes Claude for Healthcare as “hardwired to be cautious, honest, and far less likely to ‘hallucinate’ a medical answer just to be helpful.” Developer reporting in The New Stack makes the same point from another angle: engineers found Claude “less likely to hallucinate or claim success when it didn’t actually find the right solution,” especially in coding and agent workflows.
Taken together with the benchmark, the picture is consistent: Claude is tuned to leave more gaps.
That is a product decision, not an emergent moral trait. Anthropic trains the same kind of transformer architecture as its rivals, then layers on reinforcement learning and preference modeling that explicitly reward refusing bad premises and admitting uncertainty. In other words, the slider for “don’t overstep” has been pushed noticeably further than in most ChatGPT configurations.
The key question is not whether Claude is “more honest as an entity.” It is why Anthropic chose that position on the slider — and what it costs you.
BullshitBench is unusual among LLM benchmarks because it does not reward the most detailed or helpful answer. It rewards epistemic calibration — knowing when a question cannot be answered truthfully on the given information.
Concretely, prompts include things like:
Models are judged by a three‑model panel (Claude, GPT, Gemini) on whether they detect that the premise is broken and challenge it clearly instead of improvising. The headline metric — detection rate — is essentially: how often does the model stop and say “this is nonsense” rather than produce a fluent lie.
This kind of test systematically favors models that:
On those axes, Claude does well. Qwen models, which many developers also describe as “blunt but solid,” do too. ChatGPT and Gemini, particularly consumer‑facing variants, look worse because they have been tuned for a different goal: never leave the user empty‑handed.
The benchmark therefore does not tell you which model is smartest; it tells you which is most willing to disappoint you in the service of not lying. In Claude vs ChatGPT terms, it’s a calibration test, not an IQ test.
That matters because many real‑world failures are not about subtle knowledge gaps. They are about a system breezily plowing ahead on a false premise — the legal citation that never existed, the database table that isn’t there, the UI button an agent happily clicks in its imagination.
BullshitBench suggests that, at least on synthetic nonsense, Claude hits the brakes more often.
If that were the whole story, Anthropic could plausibly sell Claude as actually less prone to hallucination in a fundamental way. Their own researchers do not believe that.
In 2025, Anthropic’s interpretability team published “On the Biology of a Large Language Model,” a detailed circuit‑tracing study of Claude (specifically Claude 3.5 Haiku). Using attribution graphs and feature visualization, they identified internal “inhibitory” circuits that appear to regulate when the model asserts an answer versus when it should refrain or express uncertainty. Those circuits, in their telling, are part of why Claude can sometimes behave conservatively.
But they also documented the failure modes. In some experiments, when given a math problem it couldn’t solve correctly, the model still produced a confident answer and then generated a plausible‑looking chain‑of‑thought explanation — after the fact. As Wired’s Steven Levy reported, the researchers explicitly borrowed philosopher Harry Frankfurt’s term: in these cases Claude was “bullshitting… coming up with an answer, any answer, without caring whether it is true or false.”
The paper describes hallucinations as misfires of those inhibitory mechanisms: the internal “don’t make things up” features fail to activate, and the generative machinery does what it always does — continue the pattern with something fluent and context‑appropriate, regardless of truth.
This matters for the current debate because it shows the underlying machinery in Claude vs ChatGPT is not qualitatively different. Both are large language models reconstructing likely continuations from patterns in their weights, not fact‑retrieval systems. Both will invent answers when their internal guardrails fail or when their reward models suggest users would rather be reassured than corrected.
Anthropic’s choice has been to strengthen those guardrails and reward demurral more heavily. That changes the rate and surface of hallucinations, especially on adversarial nonsense. It does not remove the capacity for confident invention.
Or as Anthropic CEO Dario Amodei told TechCrunch, “AI models probably hallucinate less than humans, but they hallucinate in more surprising ways.” The interpretability work is, in effect, a catalog of those surprises.
If both Claude and ChatGPT still hallucinate, the practical question is how to choose between them.
The sensible way to think about Claude vs ChatGPT is not “which is more honest?” but “what is the cost of a confident lie in this context?”
When the cost is high — agents, automation, and anything that touches real‑world systems — Claude’s conservative tuning is a genuine advantage.
Developers quoted in The New Stack and on public forums describe a consistent pattern when swapping GPT‑4‑based agents for Claude‑based ones:
Those are precisely the situations covered in NovaKnown’s own analysis of whether large language models are reliable for business use: places where false positives — acting on something that is not true — break systems. In that world, a model that frustrates you a little more often by refusing to go along can be the safer building block.
The flip side is just as real. Users who rely on ChatGPT for brainstorming, drafting, and creative riffing often find Claude’s caution annoying. It hedges, inserts caveats, or occasionally refuses speculative prompts that GPT will happily improvise around. For many consumer uses — ideation, fiction, language practice — the cost of a confident lie is low and the value of never hitting a blank response is high.
Even in professional settings, there are domains where aggressive suggestion is useful. A marketer asking for ten risky campaign ideas wants volume, not calibrated epistemology. A novelist collaborating with an AI does not need it to be right about 19th‑century train schedules on the first pass.
So the decision rule is less about brand loyalty and more about error budgeting:
That framing also clarifies why Anthropic is aggressively targeting health systems with Claude, as Becker’s Hospital Review reports, while OpenAI leans heavily on ChatGPT’s role as a universal assistant. They are optimizing for different failure costs.
And if you are building your own infrastructure on top of any model, you can push the distinction further. The internal “Dawn” system described in one Reddit comment — with additional layers that force the model to label what it knows vs what it guesses before speaking — is exactly the kind of architectural enforcement that BullshitBench is indirectly measuring in Claude, applied more broadly.
The uncomfortable conclusion is that there is no clean escape from hallucination yet — only different ways of shaping it. The models are getting better at knowing when they are guessing; the real test is whether we design our systems to care.
Originally published on novaknown.com