Open Methodology · March 2026

Benchmark Methodology

How we build the test corpus, measure accuracy, compute signals, and ensure our results are reproducible and independent. Benchmarks without methodology are marketing.

2,400

Total Samples

1,200

Human-Written

1,200

AI-Generated

Content Categories

How the Benchmark Corpus Was Built

The benchmark corpus is the foundation of every figure on this site. Getting it right matters more than anything else. We made the following design decisions:

Human-Written Samples (1,200)

240 samples per category across five content types: academic writing, journalism, marketing copy, technical documentation, and creative writing. All human samples meet two criteria: (1) confirmed authorship by a known human writer, and (2) written before widespread LLM adoption (prior to November 2022 ChatGPT launch). Sources include open-access academic archives, pre-2022 news corpora, marketing copy from company blogs with dated publication records, technical documentation from pre-2022 software projects, and published short fiction.

We deliberately excluded content that might have been AI-assisted even before ChatGPT — for example, content from companies known to use early language model tools. When in doubt, we excluded.

AI-Generated Samples (1,200)

300 samples each from four model families: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 70B. For each model, we generated 60 samples per content category using category-matched prompts. No post-processing was applied. No instructions to avoid detection were included. We used default sampling parameters for all models.

We re-generated all AI samples in Q1 2026 using the current production versions of each model. Earlier model versions may have different detection fingerprints.

Sample Length Distribution

Length Bucket	Word Count Range	% of Corpus	Rationale
Short	150–250 words	20%	Social posts, brief summaries, short-form content
Medium	250–400 words	50%	Most common real-world submission length
Long	400–600 words	30%	Longer essays, reports, detailed content

Detection Pipeline

📝

Sample Prep

All samples stripped of author/source metadata. Plain text only. Consistent encoding.

🤖

API Query

Each sample submitted to each detector via API at default threshold settings.

📊

Score Record

Raw probability score, binary classification, and latency logged for every sample.

✅

Metric Compute

Accuracy, FPR, FNR computed per tool, per content category, and overall.

Metric Definitions

Metric	Formula	What It Measures
Overall Accuracy	(TP + TN) / 2,400	Correct classifications across all 2,400 samples
False Positive Rate	FP / 1,200	Human texts incorrectly flagged as AI — the wrongful accusation rate
False Negative Rate	FN / 1,200	AI texts incorrectly passed as human — the missed detection rate
API Latency	Median of 100 calls	Response time on a 200-word sample from fixed datacenter
Precision	TP / (TP + FP)	Of texts flagged as AI, what fraction actually are
Recall	TP / (TP + FN)	Of actual AI texts, what fraction were caught

Signal Analysis (Independent of Detectors)

In addition to measuring detector performance, we independently compute statistical signals on every sample in the corpus. This lets us understand detection from first principles rather than treating detectors as black boxes.

Perplexity

Sentence-level token probability computed via GPT-2 reference model. AI text averages PP=38 in our corpus; human text averages PP=87. High overlap at 60–70 PP.

Burstiness

Coefficient of variation of sentence perplexities. Human writing CV averages 0.71; AI text averages 0.29. Most discriminating single signal in our analysis.

Type-Token Ratio

Unique words / total words. Human average: 0.71. AI average: 0.57. Diminishing discriminative power as text length increases.

Transition Phrase Density

Frequency of 45 catalogued AI-associated phrases per 1,000 words. Human average: 1.2. AI average: 6.7. Strong signal; easily defeated by humanizers.

Independence Statement

We have no affiliate relationships, revenue-sharing arrangements, or commercial partnerships with any tool reviewed or benchmarked on this site. API access was purchased at standard commercial rates from our own funds. No vendor was notified before or during testing. No vendor had any ability to influence testing conditions, thresholds, or how we report results.

If a vendor believes a figure is materially incorrect, they may contact us with specific, documented evidence. We will verify and, if warranted, update figures with a published correction notice. We have issued two correction notices since launch; both are documented with dates and the nature of the correction.

Corrections policy: We correct errors. We do not remove unflattering-but-accurate results.

Benchmark

Full Results

10 tools ranked by accuracy and FPR.

Research

Published Studies

Bypass rates, FPR domains, voice deepfakes.

Case Study

The Human Writing Standard

_why's text through all detectors: near-zero.

Frequently Asked Questions

Why 2,400 samples? Is that enough?

2,400 samples provides enough statistical power to detect meaningful accuracy differences between tools at the 5% significance level. For a binary classification task, this gives roughly ±2% confidence intervals on reported accuracy figures. We consider this sufficient for ranking purposes. Larger corpora would reduce confidence intervals further but at significant cost and with diminishing returns on ranking reliability.

Why don’t you publish the test corpus?

Publishing the corpus would allow vendors to train specifically against it, defeating the purpose of independent testing. We rotate a portion of the corpus with each quarterly update and can confirm that our figures hold on new samples drawn from the same distribution. We provide enough methodology detail for other researchers to construct comparable corpora independently.

Are your latency measurements representative?

Latency figures are median values from 100 API calls on a 200-word sample from a fixed datacenter region (US East). They are representative of median API performance but will vary by text length, geographic location, time of day, and current server load. We report medians rather than means to reduce the influence of outlier slow responses.

Do you test new tools on request?

We test tools we identify as significant to the market. We do not accept payment for testing inclusion and do not guarantee testing any particular tool. If a tool you’re looking for is not in our benchmark, contact us and we will consider it for the next quarterly update.