Study 1 — AI Humanizer Bypass Rates: 2025 Annual Survey
We tested 14 commercially available AI text humanizer tools against 6 AI detectors using 4,200 text samples. The humanizers ranged from simple paraphrase-and-replace tools to sophisticated neural rewriters. For each pairing we measured accuracy before and after humanization and calculated the bypass rate — the percentage of AI texts that a detector failed to flag after humanization was applied.
Key Findings
Bypass Rates by Detector (After Humanization)
Implication: No detector is robust against all humanizers. The tools most vulnerable are those that rely primarily on perplexity scoring (GPTZero, Sapling) rather than hybrid approaches. Provenance-based systems (C2PA metadata, SynthID watermarking) are not defeated by humanization because they operate at the generation layer, not the output text layer. Statistical detection alone should not be treated as a durable solution in adversarial contexts.
Study 2 — Domain-Specific False Positive Rates
All published benchmark FPR figures are averages across content types. This study tested whether FPRs differed materially by domain — and by how much. We collected 300 human-written samples in each of 8 writing domains and ran each sample through 4 major detectors at default thresholds.
| Domain | Proofademic AI | Originality.ai | GPTZero | Sapling AI | Risk Level |
|---|---|---|---|---|---|
| News Journalism | 4% | 5% | 6% | 9% | Low |
| Creative Fiction | 5% | 6% | 11% | 13% | Low–Medium |
| Academic Essays | 5% | 8% | 9% | 16% | Medium |
| Business Writing | 6% | 9% | 12% | 18% | Medium |
| Legal Writing | 8% | 11% | 14% | 22% | Medium–High |
| STEM Academic | 9% | 12% | 18% | 27% | High |
| Non-Native English | 11% | 16% | 21% | 31% | Very High |
| Technical Docs | 12% | 14% | 19% | 31% | Very High |
Key finding: Non-native English writers and STEM/technical writers face FPRs 2–5× higher than the headline benchmark average. Technical writing with domain-specific vocabulary produces low perplexity scores that statistical detectors misread as AI signals. This is consistent with findings published in Liang et al. (2023), which documented FPRs exceeding 50% on TOEFL essays for some detectors. Institutions using AI detectors for academic integrity should explicitly measure FPRs on their own student population’s content type before deployment.
Study 3 — Voice Deepfake Detection Benchmark 2025
Voice deepfake detection is substantially less mature than text detection. We tested 5 voice authenticity tools across 600 audio clips spanning 8 TTS systems (ElevenLabs, PlayHT, Murf, Speechify, Google TTS, Amazon Polly, Microsoft Azure TTS, and OpenAI TTS) and 4 voice cloning frameworks.
All voice detectors degraded significantly on expressive or emotionally varied synthetic voice. Current detectors appear to rely on spectral artifacts that newer high-quality TTS systems are actively eliminating in successive model versions. Voice cloning of specific speakers (vs. generic TTS) reduced accuracy by a further 12–18 percentage points across all tested tools.
Frequently Asked Questions
Can any AI detector reliably detect humanized AI text?
No. Our bypass study found accuracy drops of 21–33 percentage points on humanized text across all tested detectors. The most resistant tool (Proofademic AI) still dropped from 93% to 69% accuracy on humanized content. Detectors that rely primarily on perplexity scoring (GPTZero) are most vulnerable, losing up to 33 percentage points. Provenance-based approaches like C2PA watermarking are not affected by humanization but require adoption at the generation layer.
Why do non-native English speakers get flagged more often?
Non-native English writing tends to use more formulaic sentence structures, more transition phrases, and lower vocabulary diversity than native speaker writing — all statistical characteristics that AI detectors associate with machine-generated text. Our data shows FPRs 2–4× higher for non-native English writers. This is consistent with published academic research. Institutions serving international student populations should measure FPRs on their specific student body before relying on AI detection for integrity enforcement.
Are these studies peer-reviewed?
Our studies are published as independent research, not through formal academic peer review. We document methodology in full and publish all data publicly to enable independent replication. We are aware of and cite relevant peer-reviewed work where applicable (Liang et al. 2023, Mitchell et al. 2023). Our methodology page documents all testing decisions and scope limitations transparently.
How often do you update your research?
We publish new studies quarterly and update existing figures when material changes occur (detector updates, new model releases). All published studies are dated and include a “last reviewed” notation. The current studies reflect testing conducted in January–March 2026.