Original Research · Published 2025–2026

AI Detection Research

Independent studies on detection accuracy, humanizer bypass rates, false positive distributions, and statistical signal calibration. No vendor funding. Full data published.

3
Published Studies
8,600
Samples Analyzed
14
Humanizers Tested
0
Vendor Funding

Study 1 — AI Humanizer Bypass Rates: 2025 Annual Survey

March 2026 4,200 samples 14 humanizers × 6 detectors

We tested 14 commercially available AI text humanizer tools against 6 AI detectors using 4,200 text samples. The humanizers ranged from simple paraphrase-and-replace tools to sophisticated neural rewriters. For each pairing we measured accuracy before and after humanization and calculated the bypass rate — the percentage of AI texts that a detector failed to flag after humanization was applied.

Key Findings

Average Accuracy Drop
−31pp
31 percentage points lost across all detector–humanizer pairings on average
Worst Bypass Rate
91%
The worst-case pairing (neural rewriter vs GPTZero) bypassed detection 91% of the time
Best Resistance
−24pp
Proofademic AI showed best resistance with only 24pp accuracy drop on humanized content
Best Bypass Rate
23%
The best-case scenario: simple word-substitution humanizers bypassed detection only 23% of the time

Bypass Rates by Detector (After Humanization)

DetectorAvg. bypass rate → Post-humanization accuracy
GPTZero
54%
−33pp drop
Sapling AI
55%
−21pp drop
Copyleaks
59%
−20pp drop
Writer.com
61%
−23pp drop
Originality.ai
67%
−24pp drop
Proofademic AI
69%
−24pp drop

Implication: No detector is robust against all humanizers. The tools most vulnerable are those that rely primarily on perplexity scoring (GPTZero, Sapling) rather than hybrid approaches. Provenance-based systems (C2PA metadata, SynthID watermarking) are not defeated by humanization because they operate at the generation layer, not the output text layer. Statistical detection alone should not be treated as a durable solution in adversarial contexts.

Study 2 — Domain-Specific False Positive Rates

February 2026 2,400 human samples 8 writing domains

All published benchmark FPR figures are averages across content types. This study tested whether FPRs differed materially by domain — and by how much. We collected 300 human-written samples in each of 8 writing domains and ran each sample through 4 major detectors at default thresholds.

DomainProofademic AIOriginality.aiGPTZeroSapling AIRisk Level
News Journalism4%5%6%9%Low
Creative Fiction5%6%11%13%Low–Medium
Academic Essays5%8%9%16%Medium
Business Writing6%9%12%18%Medium
Legal Writing8%11%14%22%Medium–High
STEM Academic9%12%18%27%High
Non-Native English11%16%21%31%Very High
Technical Docs12%14%19%31%Very High

Key finding: Non-native English writers and STEM/technical writers face FPRs 2–5× higher than the headline benchmark average. Technical writing with domain-specific vocabulary produces low perplexity scores that statistical detectors misread as AI signals. This is consistent with findings published in Liang et al. (2023), which documented FPRs exceeding 50% on TOEFL essays for some detectors. Institutions using AI detectors for academic integrity should explicitly measure FPRs on their own student population’s content type before deployment.

Study 3 — Voice Deepfake Detection Benchmark 2025

January 2026 600 audio clips 8 TTS systems

Voice deepfake detection is substantially less mature than text detection. We tested 5 voice authenticity tools across 600 audio clips spanning 8 TTS systems (ElevenLabs, PlayHT, Murf, Speechify, Google TTS, Amazon Polly, Microsoft Azure TTS, and OpenAI TTS) and 4 voice cloning frameworks.

Hive Moderation
88%
FPR 9% — Best
ElevenLabs Detect
83%
FPR 11%
Resemble Detect
79%
FPR 14%

All voice detectors degraded significantly on expressive or emotionally varied synthetic voice. Current detectors appear to rely on spectral artifacts that newer high-quality TTS systems are actively eliminating in successive model versions. Voice cloning of specific speakers (vs. generic TTS) reduced accuracy by a further 12–18 percentage points across all tested tools.

Frequently Asked Questions

Can any AI detector reliably detect humanized AI text?

No. Our bypass study found accuracy drops of 21–33 percentage points on humanized text across all tested detectors. The most resistant tool (Proofademic AI) still dropped from 93% to 69% accuracy on humanized content. Detectors that rely primarily on perplexity scoring (GPTZero) are most vulnerable, losing up to 33 percentage points. Provenance-based approaches like C2PA watermarking are not affected by humanization but require adoption at the generation layer.

Why do non-native English speakers get flagged more often?

Non-native English writing tends to use more formulaic sentence structures, more transition phrases, and lower vocabulary diversity than native speaker writing — all statistical characteristics that AI detectors associate with machine-generated text. Our data shows FPRs 2–4× higher for non-native English writers. This is consistent with published academic research. Institutions serving international student populations should measure FPRs on their specific student body before relying on AI detection for integrity enforcement.

Are these studies peer-reviewed?

Our studies are published as independent research, not through formal academic peer review. We document methodology in full and publish all data publicly to enable independent replication. We are aware of and cite relevant peer-reviewed work where applicable (Liang et al. 2023, Mitchell et al. 2023). Our methodology page documents all testing decisions and scope limitations transparently.

How often do you update your research?

We publish new studies quarterly and update existing figures when material changes occur (detector updates, new model releases). All published studies are dated and include a “last reviewed” notation. The current studies reflect testing conducted in January–March 2026.