Original Research · Published 2025–2026

AI Detection Research

Independent studies on detection accuracy, humanizer bypass rates, false positive distributions, and statistical signal calibration. No vendor funding. Full data published.

Published Studies

8,600

Samples Analyzed

Humanizers Tested

Vendor Funding

Study 1 — AI Humanizer Bypass Rates: 2025 Annual Survey

March 2026 4,200 samples 14 humanizers × 6 detectors

We tested 14 commercially available AI text humanizer tools against 6 AI detectors using 4,200 text samples. The humanizers ranged from simple paraphrase-and-replace tools to sophisticated neural rewriters. For each pairing we measured accuracy before and after humanization and calculated the bypass rate — the percentage of AI texts that a detector failed to flag after humanization was applied.

Key Findings

Average Accuracy Drop

−31pp

31 percentage points lost across all detector–humanizer pairings on average

Worst Bypass Rate

91%

The worst-case pairing (neural rewriter vs GPTZero) bypassed detection 91% of the time

Best Resistance

−24pp

Proofademic AI showed best resistance with only 24pp accuracy drop on humanized content

Best Bypass Rate

23%

The best-case scenario: simple word-substitution humanizers bypassed detection only 23% of the time

Bypass Rates by Detector (After Humanization)

DetectorAvg. bypass rate → Post-humanization accuracy

GPTZero

54%

−33pp drop

Sapling AI

55%

−21pp drop

Copyleaks

59%

−20pp drop

Writer.com

61%

−23pp drop

Originality.ai

67%

−24pp drop

Proofademic AI

69%

−24pp drop

Implication: No detector is robust against all humanizers. The tools most vulnerable are those that rely primarily on perplexity scoring (GPTZero, Sapling) rather than hybrid approaches. Provenance-based systems (C2PA metadata, SynthID watermarking) are not defeated by humanization because they operate at the generation layer, not the output text layer. Statistical detection alone should not be treated as a durable solution in adversarial contexts.

Study 2 — Domain-Specific False Positive Rates

February 2026 2,400 human samples 8 writing domains

All published benchmark FPR figures are averages across content types. This study tested whether FPRs differed materially by domain — and by how much. We collected 300 human-written samples in each of 8 writing domains and ran each sample through 4 major detectors at default thresholds.

Domain	Proofademic AI	Originality.ai	GPTZero	Sapling AI	Risk Level
News Journalism	4%	5%	6%	9%	Low
Creative Fiction	5%	6%	11%	13%	Low–Medium
Academic Essays	5%	8%	9%	16%	Medium
Business Writing	6%	9%	12%	18%	Medium
Legal Writing	8%	11%	14%	22%	Medium–High
STEM Academic	9%	12%	18%	27%	High
Non-Native English	11%	16%	21%	31%	Very High
Technical Docs	12%	14%	19%	31%	Very High

Key finding: Non-native English writers and STEM/technical writers face FPRs 2–5× higher than the headline benchmark average. Technical writing with domain-specific vocabulary produces low perplexity scores that statistical detectors misread as AI signals. This is consistent with findings published in Liang et al. (2023), which documented FPRs exceeding 50% on TOEFL essays for some detectors. Institutions using AI detectors for academic integrity should explicitly measure FPRs on their own student population’s content type before deployment.

Study 3 — Voice Deepfake Detection Benchmark 2025

January 2026 600 audio clips 8 TTS systems

Voice deepfake detection is substantially less mature than text detection. We tested 5 voice authenticity tools across 600 audio clips spanning 8 TTS systems (ElevenLabs, PlayHT, Murf, Speechify, Google TTS, Amazon Polly, Microsoft Azure TTS, and OpenAI TTS) and 4 voice cloning frameworks.

Hive Moderation

88%

FPR 9% — Best

ElevenLabs Detect

83%

FPR 11%

Resemble Detect

79%

FPR 14%

All voice detectors degraded significantly on expressive or emotionally varied synthetic voice. Current detectors appear to rely on spectral artifacts that newer high-quality TTS systems are actively eliminating in successive model versions. Voice cloning of specific speakers (vs. generic TTS) reduced accuracy by a further 12–18 percentage points across all tested tools.

Benchmark

Full 10-Tool Benchmark

Proofademic AI, Originality.ai, GPTZero & 7 more.

Methodology

How We Test

Corpus construction, metrics, independence policy.

Free Tool

Try the AI Detector

Client-side, instant, no data sent.

Frequently Asked Questions

Can any AI detector reliably detect humanized AI text?

No. Our bypass study found accuracy drops of 21–33 percentage points on humanized text across all tested detectors. The most resistant tool (Proofademic AI) still dropped from 93% to 69% accuracy on humanized content. Detectors that rely primarily on perplexity scoring (GPTZero) are most vulnerable, losing up to 33 percentage points. Provenance-based approaches like C2PA watermarking are not affected by humanization but require adoption at the generation layer.

Why do non-native English speakers get flagged more often?

Non-native English writing tends to use more formulaic sentence structures, more transition phrases, and lower vocabulary diversity than native speaker writing — all statistical characteristics that AI detectors associate with machine-generated text. Our data shows FPRs 2–4× higher for non-native English writers. This is consistent with published academic research. Institutions serving international student populations should measure FPRs on their specific student body before relying on AI detection for integrity enforcement.

Are these studies peer-reviewed?

Our studies are published as independent research, not through formal academic peer review. We document methodology in full and publish all data publicly to enable independent replication. We are aware of and cite relevant peer-reviewed work where applicable (Liang et al. 2023, Mitchell et al. 2023). Our methodology page documents all testing decisions and scope limitations transparently.

How often do you update your research?

We publish new studies quarterly and update existing figures when material changes occur (detector updates, new model releases). All published studies are dated and include a “last reviewed” notation. The current studies reflect testing conducted in January–March 2026.