Claude vs ChatGPT: Why Claude Feels More Honest and Accurate

# chatgpt# claude# anthropic# trustworthyai

Simon Paxton

A 100‑question “bullshit benchmark” sounds like a joke until you see the chart. In BullshitBench v2,...

A 100‑question “bullshit benchmark” sounds like a joke until you see the chart. In BullshitBench v2, Anthropic’s Claude models sit at the top, flagging nonsense prompts as nonsense far more often than comparable ChatGPT and Gemini models — a concrete data point behind the online refrain that in Claude vs ChatGPT, Claude is “the least bullshit‑y” AI.

TL;DR

BullshitBench and field reports suggest Claude calls out nonsense and uncertainty more often than ChatGPT, but Anthropic’s own interpretability work shows Claude still “bullshits” — it just does so less readily and with more internal brakes.
That behavior is a design choice: Anthropic has tuned Claude to be conservative and calibrated, which is a win for agents, healthcare, and automation, and a real drawback when you want aggressive creativity or never‑say‑no assistance.
The right lesson is not to crown Claude the honest winner, but to pick models by the cost of a confident lie: use the cautious one where false positives break systems, and the more fluent one where making something up is often the point.

Why ‘least bullshit‑y’ is a design choice — not an absolute truth

BullshitBench, created by Peter Gostev, does one simple thing: it asks language models 100 carefully‑crafted nonsense questions across software, finance, law, medicine and physics, and scores them on whether they detect the nonsense and push back instead of answering confidently. The public leaderboard shows recent Claude models with much higher “green” detection rates than OpenAI’s GPT series and Google’s Gemini models — meaning Claude more often says some version of “this doesn’t make sense” rather than forging ahead.

Healthcare executives told Becker’s Hospital Review this is exactly why they are piloting Claude in clinical settings. One describes Claude for Healthcare as “hardwired to be cautious, honest, and far less likely to ‘hallucinate’ a medical answer just to be helpful.” Developer reporting in The New Stack makes the same point from another angle: engineers found Claude “less likely to hallucinate or claim success when it didn’t actually find the right solution,” especially in coding and agent workflows.

Taken together with the benchmark, the picture is consistent: Claude is tuned to leave more gaps.

That is a product decision, not an emergent moral trait. Anthropic trains the same kind of transformer architecture as its rivals, then layers on reinforcement learning and preference modeling that explicitly reward refusing bad premises and admitting uncertainty. In other words, the slider for “don’t overstep” has been pushed noticeably further than in most ChatGPT configurations.

The key question is not whether Claude is “more honest as an entity.” It is why Anthropic chose that position on the slider — and what it costs you.

Benchmarks that favor Claude vs ChatGPT (and what they actually measure)

BullshitBench is unusual among LLM benchmarks because it does not reward the most detailed or helpful answer. It rewards epistemic calibration — knowing when a question cannot be answered truthfully on the given information.

Concretely, prompts include things like:

Non‑existent software frameworks with plausible‑sounding names
Finance questions that smuggle in impossible numbers
Medical scenarios that misapply known conditions

Models are judged by a three‑model panel (Claude, GPT, Gemini) on whether they detect that the premise is broken and challenge it clearly instead of improvising. The headline metric — detection rate — is essentially: how often does the model stop and say “this is nonsense” rather than produce a fluent lie.

This kind of test systematically favors models that:

Are more willing to say “I don’t know” or “this doesn’t exist”
Have been tuned to question premises instead of obligingly continuing
Take a slight hit to user‑perceived helpfulness in exchange for fewer confident errors

On those axes, Claude does well. Qwen models, which many developers also describe as “blunt but solid,” do too. ChatGPT and Gemini, particularly consumer‑facing variants, look worse because they have been tuned for a different goal: never leave the user empty‑handed.

The benchmark therefore does not tell you which model is smartest; it tells you which is most willing to disappoint you in the service of not lying. In Claude vs ChatGPT terms, it’s a calibration test, not an IQ test.

That matters because many real‑world failures are not about subtle knowledge gaps. They are about a system breezily plowing ahead on a false premise — the legal citation that never existed, the database table that isn’t there, the UI button an agent happily clicks in its imagination.

BullshitBench suggests that, at least on synthetic nonsense, Claude hits the brakes more often.

Interpretability proves the problem isn’t gone — Claude still hallucinates

If that were the whole story, Anthropic could plausibly sell Claude as actually less prone to hallucination in a fundamental way. Their own researchers do not believe that.

In 2025, Anthropic’s interpretability team published “On the Biology of a Large Language Model,” a detailed circuit‑tracing study of Claude (specifically Claude 3.5 Haiku). Using attribution graphs and feature visualization, they identified internal “inhibitory” circuits that appear to regulate when the model asserts an answer versus when it should refrain or express uncertainty. Those circuits, in their telling, are part of why Claude can sometimes behave conservatively.

But they also documented the failure modes. In some experiments, when given a math problem it couldn’t solve correctly, the model still produced a confident answer and then generated a plausible‑looking chain‑of‑thought explanation — after the fact. As Wired’s Steven Levy reported, the researchers explicitly borrowed philosopher Harry Frankfurt’s term: in these cases Claude was “bullshitting… coming up with an answer, any answer, without caring whether it is true or false.”

The paper describes hallucinations as misfires of those inhibitory mechanisms: the internal “don’t make things up” features fail to activate, and the generative machinery does what it always does — continue the pattern with something fluent and context‑appropriate, regardless of truth.

This matters for the current debate because it shows the underlying machinery in Claude vs ChatGPT is not qualitatively different. Both are large language models reconstructing likely continuations from patterns in their weights, not fact‑retrieval systems. Both will invent answers when their internal guardrails fail or when their reward models suggest users would rather be reassured than corrected.

Anthropic’s choice has been to strengthen those guardrails and reward demurral more heavily. That changes the rate and surface of hallucinations, especially on adversarial nonsense. It does not remove the capacity for confident invention.

Or as Anthropic CEO Dario Amodei told TechCrunch, “AI models probably hallucinate less than humans, but they hallucinate in more surprising ways.” The interpretability work is, in effect, a catalog of those surprises.

So what to do now: pick the right model for the job

If both Claude and ChatGPT still hallucinate, the practical question is how to choose between them.

The sensible way to think about Claude vs ChatGPT is not “which is more honest?” but “what is the cost of a confident lie in this context?”

When the cost is high — agents, automation, and anything that touches real‑world systems — Claude’s conservative tuning is a genuine advantage.

Developers quoted in The New Stack and on public forums describe a consistent pattern when swapping GPT‑4‑based agents for Claude‑based ones:

A desktop automation agent that used to hallucinate non‑existent UI elements starts saying “I can’t find that button” instead of clicking the wrong thing.
Multi‑agent systems coordinating through shared memory see fewer “cascading hallucinations,” where one fabricated fact becomes another agent’s input and the error multiplies.
Coding assistants stop cheerfully declaring success when they have not actually solved the task.

Those are precisely the situations covered in NovaKnown’s own analysis of whether large language models are reliable for business use: places where false positives — acting on something that is not true — break systems. In that world, a model that frustrates you a little more often by refusing to go along can be the safer building block.

The flip side is just as real. Users who rely on ChatGPT for brainstorming, drafting, and creative riffing often find Claude’s caution annoying. It hedges, inserts caveats, or occasionally refuses speculative prompts that GPT will happily improvise around. For many consumer uses — ideation, fiction, language practice — the cost of a confident lie is low and the value of never hitting a blank response is high.

Even in professional settings, there are domains where aggressive suggestion is useful. A marketer asking for ten risky campaign ideas wants volume, not calibrated epistemology. A novelist collaborating with an AI does not need it to be right about 19th‑century train schedules on the first pass.

So the decision rule is less about brand loyalty and more about error budgeting:

If your system treats model output as advice to a human, with review and correction baked in, a more fluent, less conservative model can be a productivity win.
If your system treats model output as ground truth or as direct commands to other software — agents clicking buttons, scripts touching production data, tools for clinicians — you want the model that is more likely to stop and say “this doesn’t add up.”

That framing also clarifies why Anthropic is aggressively targeting health systems with Claude, as Becker’s Hospital Review reports, while OpenAI leans heavily on ChatGPT’s role as a universal assistant. They are optimizing for different failure costs.

And if you are building your own infrastructure on top of any model, you can push the distinction further. The internal “Dawn” system described in one Reddit comment — with additional layers that force the model to label what it knows vs what it guesses before speaking — is exactly the kind of architectural enforcement that BullshitBench is indirectly measuring in Claude, applied more broadly.

Key Takeaways