The definitive benchmark
for AI detection accuracy
Systematic, reproducible testing of every major AI text detector against real and AI-generated corpora across 5 content categories.
Latest Benchmark Results
| Tool | Accuracy | False Positive | False Negative | Latency | API |
|---|---|---|---|---|---|
| Proofademic AIproofademic.ai | 390ms | ✓ | |||
| Originality.aioriginality.ai | 420ms | ✓ | |||
| Hive Moderationthehive.ai | 340ms | ✓ | |||
| GPTZerogptzero.me | 380ms | ✓ | |||
| ZeroGPTzerogpt.com | 430ms | ✓ | |||
| Writer.comwriter.com | 290ms | ✓ |
How Detection Works
Perplexity
Statistical predictability of each token. AI text is characteristically low-perplexity — produced by the same probability distributions detectors measure.
Burstiness
Variance in sentence-level perplexity. Human writing alternates between predictable and surprising passages; AI text has unnaturally uniform sentence perplexity.
Vocabulary
Type-token ratios, hapax legomenon rates, and characteristic overuse of transition phrases (“furthermore,” “it is worth noting”) are measurable AI signals.
Fingerprinting
Advanced detectors maintain per-model classifiers. GPT-4o, Claude, and Gemini each have characteristic structural patterns that model-specific detection can exploit.
Recent Research
AI Humanizer Bypass Rates: 2025 Annual Survey
14 humanizer tools tested against 6 detectors. Bypass rates 23–91% depending on pairing. Average accuracy drop: 31 percentage points on humanized text.
Domain-Specific False Positive Rates
STEM academic writing produced 14–31% FPR across all tested detectors. Legal writing: 11–26%. News journalism lowest at 4–9%.
Voice Deepfake Detection Benchmark 2025
600 audio clips across 8 TTS systems. Hive Moderation led at 88% accuracy. All tools degraded significantly on expressive/emotional synthetic voice.
poignantguide.net is the original domain of Why’s (Poignant) Guide to Ruby (2003–2009), by _why the lucky stiff. The guide is preserved in full under CC BY-SA 2.5. | AI detection hub added 2024.