How the Benchmark Corpus Was Built
The benchmark corpus is the foundation of every figure on this site. Getting it right matters more than anything else. We made the following design decisions:
Human-Written Samples (1,200)
240 samples per category across five content types: academic writing, journalism, marketing copy, technical documentation, and creative writing. All human samples meet two criteria: (1) confirmed authorship by a known human writer, and (2) written before widespread LLM adoption (prior to November 2022 ChatGPT launch). Sources include open-access academic archives, pre-2022 news corpora, marketing copy from company blogs with dated publication records, technical documentation from pre-2022 software projects, and published short fiction.
We deliberately excluded content that might have been AI-assisted even before ChatGPT — for example, content from companies known to use early language model tools. When in doubt, we excluded.
AI-Generated Samples (1,200)
300 samples each from four model families: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 70B. For each model, we generated 60 samples per content category using category-matched prompts. No post-processing was applied. No instructions to avoid detection were included. We used default sampling parameters for all models.
We re-generated all AI samples in Q1 2026 using the current production versions of each model. Earlier model versions may have different detection fingerprints.
Sample Length Distribution
| Length Bucket | Word Count Range | % of Corpus | Rationale |
|---|---|---|---|
| Short | 150–250 words | 20% | Social posts, brief summaries, short-form content |
| Medium | 250–400 words | 50% | Most common real-world submission length |
| Long | 400–600 words | 30% | Longer essays, reports, detailed content |
Detection Pipeline
Metric Definitions
| Metric | Formula | What It Measures |
|---|---|---|
| Overall Accuracy | (TP + TN) / 2,400 | Correct classifications across all 2,400 samples |
| False Positive Rate | FP / 1,200 | Human texts incorrectly flagged as AI — the wrongful accusation rate |
| False Negative Rate | FN / 1,200 | AI texts incorrectly passed as human — the missed detection rate |
| API Latency | Median of 100 calls | Response time on a 200-word sample from fixed datacenter |
| Precision | TP / (TP + FP) | Of texts flagged as AI, what fraction actually are |
| Recall | TP / (TP + FN) | Of actual AI texts, what fraction were caught |
Signal Analysis (Independent of Detectors)
In addition to measuring detector performance, we independently compute statistical signals on every sample in the corpus. This lets us understand detection from first principles rather than treating detectors as black boxes.
Sentence-level token probability computed via GPT-2 reference model. AI text averages PP=38 in our corpus; human text averages PP=87. High overlap at 60–70 PP.
Coefficient of variation of sentence perplexities. Human writing CV averages 0.71; AI text averages 0.29. Most discriminating single signal in our analysis.
Unique words / total words. Human average: 0.71. AI average: 0.57. Diminishing discriminative power as text length increases.
Frequency of 45 catalogued AI-associated phrases per 1,000 words. Human average: 1.2. AI average: 6.7. Strong signal; easily defeated by humanizers.
Independence Statement
We have no affiliate relationships, revenue-sharing arrangements, or commercial partnerships with any tool reviewed or benchmarked on this site. API access was purchased at standard commercial rates from our own funds. No vendor was notified before or during testing. No vendor had any ability to influence testing conditions, thresholds, or how we report results.
If a vendor believes a figure is materially incorrect, they may contact us with specific, documented evidence. We will verify and, if warranted, update figures with a published correction notice. We have issued two correction notices since launch; both are documented with dates and the nature of the correction.
Corrections policy: We correct errors. We do not remove unflattering-but-accurate results.
Frequently Asked Questions
Why 2,400 samples? Is that enough?
2,400 samples provides enough statistical power to detect meaningful accuracy differences between tools at the 5% significance level. For a binary classification task, this gives roughly ±2% confidence intervals on reported accuracy figures. We consider this sufficient for ranking purposes. Larger corpora would reduce confidence intervals further but at significant cost and with diminishing returns on ranking reliability.
Why don’t you publish the test corpus?
Publishing the corpus would allow vendors to train specifically against it, defeating the purpose of independent testing. We rotate a portion of the corpus with each quarterly update and can confirm that our figures hold on new samples drawn from the same distribution. We provide enough methodology detail for other researchers to construct comparable corpora independently.
Are your latency measurements representative?
Latency figures are median values from 100 API calls on a 200-word sample from a fixed datacenter region (US East). They are representative of median API performance but will vary by text length, geographic location, time of day, and current server load. We report medians rather than means to reduce the influence of outlier slow responses.
Do you test new tools on request?
We test tools we identify as significant to the market. We do not accept payment for testing inclusion and do not guarantee testing any particular tool. If a tool you’re looking for is not in our benchmark, contact us and we will consider it for the next quarterly update.