Transparency

AI Detection Accuracy — How We Measure It

When we say "99.2% accuracy" we mean a specific thing on a specific test set. This page explains exactly what we measure, on what samples, and where the tool's limits are.

Last updated: May 2026 · We review this page quarterly

What 99.2% means (and doesn't)

The 99.2% figure is the headline accuracy from our internal benchmark — the percentage of samples in our internal test set where the detector's verdict matched the ground-truth label.

What it does mean: on the test set we run before each model release, the detector correctly identifies AI vs human content the vast majority of the time, with the strongest performance on raw, unedited AI output from ChatGPT, Claude, and Gemini.

What it doesn't mean: it is not an independent third-party benchmark, it is not validated against the public datasets some academic papers use, and it is not equally accurate across every content type. We discuss the known weak spots below — we'd rather you know them up front than be surprised in production.

How we run the benchmark

Sample collection — we maintain an internal test set of labelled text samples across three categories: pure AI (no human editing), pure human (no AI involvement), and mixed (human drafts with AI assistance, or AI drafts with human rewrites). Each sample carries a known ground-truth label.
Blind scoring — every sample is submitted to the same scoring pipeline a public user would hit. No special prompt engineering, no model tuning per-sample. The detector returns an AI percentage and Humanization Score for each.
Comparison to ground truth — verdicts are bucketed into correctly classified (matches the label within a margin), borderline (verdict in the 40–60% band on a pure sample), and misclassified (verdict on the opposite side of 50% from the label).
Pre-release gate — accuracy on this test set has to clear our internal threshold before a model version ships to production. The 99.2% figure is the most recent run.

What's in the test set

Pure AI

Unedited outputs from ChatGPT (GPT-4o), Claude 3.5 Sonnet, and Google Gemini Pro — academic essays, blog posts, emails, LinkedIn posts, product descriptions.

Pure Human

Human-authored writing across registers — classic literature excerpts, journalist articles, academic abstracts, Reddit-style opinions, casual personal writing, cover letters.

Mixed

Hybrid samples — human drafts polished with AI suggestions, AI drafts with human rewrites, AI summaries pasted onto human prose, and lightly humanized AI text.

Sample sizes for each category are deliberately moderate (low hundreds, not thousands) so the test set stays curated and reviewable rather than scraped. We rotate samples each release to avoid the detector overfitting to fixed examples.

Known limitations

No AI detector in 2026 is right 100% of the time, and ours is no exception. Independent tester pass-throughs (most recently May 2026) have surfaced patterns where our detector either over-flags or under-flags — we publish them here rather than hide them.

Classical literature can score as partially-AI

Hemingway or Dickens passages occasionally score 30–40% AI — the prose's controlled rhythm and balanced sentence structure trips signals the model associates with AI output. If you're checking your own work and it scores in this range, that doesn't necessarily mean it reads as AI to a human.

Formal academic abstracts can over-flag

Researcher-voice writing with precise terminology, hedged conclusions, and even sentence lengths can score around 50/50. The same author writing more conversationally will score human. We do not recommend our detector as the sole signal for academic-integrity decisions.

Mixed AI+human content skews toward AI verdict

If you wrote a draft yourself and used AI for one paragraph, the overall verdict tends to lean AI — even when the AI-influenced portion is a minority of the text. The per-sentence breakdown is more useful than the overall percentage for editor/freelancer workflows.

Short samples (under 100 words) are harder to score

The fewer words a detector has to work with, the less statistical signal it has. Our detector requires a 100-word minimum for a reason — below that, results are unreliable across every detector on the market, not just ours.

Caveats & responsible use

A few things worth saying out loud — for anyone using a TextSight score in a high-stakes decision.

Last benchmark run

May 2026

Re-run before every model release. We publish the date here so you can tell whether the score on this page reflects the production model you're scanning against.

False-positive rate

Coming Q3 2026

Split FP/FN rates per content category (academic, casual, journalistic, mixed) will be published with the independent third-party benchmark — wins and losses both.

Confidence band

Flagged in-product

Every scan returns a confidence indicator. Low-confidence scores (short samples, mixed content, unusual genres) are visually marked so you know when not to rely on the number alone.

⚠ AI detection is probabilistic

No AI detector — ours, GPTZero, Originality, Copyleaks, or any other — produces a binary truth. The score is a probability based on language signals. A high score is strong evidence that text reads as AI-generated; it is not proof that a specific person used AI. Never use a TextSight score as the sole basis for an academic-misconduct decision, an employment decision, or any consequence that affects a person. Pair the score with human review, written work history, and a conversation. Sentence-level highlights are designed to be a starting point for a discussion, not a verdict.

Where to use the score, and where not to

Use it for: editorial review, ghost-writer audits, content-quality screening, agency client deliverables, self-checking your own writing before submission, and as one signal in a broader integrity review.
Don't use it for: automatic discipline, automated grading consequences, hiring/firing decisions, or any workflow where a single threshold drives a consequential outcome without a human in the loop.

What's next

Independent benchmark (Q3 2026): we're submitting the detector for an independent third-party evaluation against a publicly disclosed dataset. We'll publish the full results — wins and losses — when it's complete.
Per-category accuracy breakdown: we'll add a table on this page showing true-positive and false-positive rates split by content category (academic, casual, journalistic, mixed) so you can judge fit for your use case.
False-positive reporting: if our detector misclassifies a piece of writing you authored yourself, please tell us. Reported samples (with permission) become the next test-set additions and directly improve the model.

Try it on your own writing

The best way to judge accuracy is to run text you wrote (or didn't) through the detector and see what it says.

Open the AI Detector →