Matthew ChenI’ve been testing AI detectors recently because they are increasingly used in schools, publishing...
I’ve been testing AI detectors recently because they are increasingly used in schools, publishing workflows, hiring, and content moderation.
Most AI detector comparisons focus on one headline number: accuracy.
But after running a benchmark across 1,000 English texts, I think that may be the wrong metric to obsess over.
The benchmark included:
I tested four AI detectors:
The most important question was not:
Which detector catches the most AI?
It was:
Which detector is least likely to falsely accuse a human writer?
That distinction matters a lot.
In a real school, workplace, or publishing environment, a false positive is not just a bad prediction. It can become an accusation.
| Detector | Overall Accuracy | Human False Positive Rate |
|---|---|---|
| GPTHumanizer | 98.0% | 0.0% |
| GPTZero | 98.7% | 2.2% |
| ZeroGPT | 88.2% | 18.4% |
| Sapling | 88.6% | 19.4% |
GPTZero achieved the highest overall accuracy in this benchmark.
But GPTHumanizer had the lowest human false positive rate.
ZeroGPT and Sapling were much more aggressive, but that also meant they mislabeled more human-written text as AI.
If you are using an AI detector for low-stakes filtering, raw accuracy might be useful.
But if you are using it to judge students, writers, job applicants, or employees, false positives should probably be treated as the most important metric.
A detector that catches slightly more AI but wrongly flags real human writing may be more harmful than a conservative detector that avoids false accusations.
AI detectors should not be used as final proof.
At best, they should be used as weak signals alongside human review, writing history, source drafts, editing patterns, and context.
Full benchmark and methodology:
https://www.gpthumanizer.ai/blog/2026-ai-detector-benchmark
Curious how others evaluate AI detectors: would you prioritize raw accuracy, AI recall, or human false positive rate?