qed-bench: benchmarking U22A8 scoring models against task-appropriate baselines

We trained scoring models on four content-judgment tasks — holistic essay quality (ASAP 2.0), SMS spam classification (UCI), AI-vs-human authorship detection (HC3 and RAID), and LLM authorship attribution across eleven LLMs — and paired each with a task-appropriate baseline: trained human raters, gold class labels, or an eight-model LLM-as-judge panel served via Amazon Bedrock.

Across the four tasks, the scoring models match or exceed the strongest available baseline at one to two orders of magnitude lower cost and latency, deterministically. Where the result breaks — most visibly on cross-distribution AI-vs-human detection — the failure mode is concentrated and stateable. Notebooks, model definitions, and per-judge artifacts are at github.com/u22a8/qed-bench.

ASAP 2.0 · essay quality ρ = 0.815 vs human raters; matches Claude Opus 4.7 (ρ = 0.813); other panel models 0.637 – 0.724
UCI SMS Spam acc 0.982 F1 = 0.934, AUC-ROC = 0.994; 20 errors on 1,115 held-out messages
whoami · AI vs human AUC 0.980.74 in-distribution AUC ~0.98; cross-distribution AUC drops to ~0.74
whatami · LLM attribution top-1 0.493 11-way classification; chance = 0.091; top-2 = 0.676; per-family top-1 = 0.543

Each scoring model is the U22A8 SDK's primitive for content judgment, fit from a small set of labelled examples and served deterministically: the same input always returns the same score, σ = 0. The same scoring API serves all four models; only the trait definitions and training data differ. Test sets are held out at training time, and baselines are matched per task. ASAP uses an LLM-as-judge panel because the only human ratings available are the ground truth being predicted, not a separate baseline; sms-spam uses gold class labels directly; whoami and whatami report AUC and class-accuracy metrics on held-out test sets. All inputs are scored via the public, unauthenticated POST /m/{handle} endpoint that any reader can call.

Holistic essay quality on ASAP 2.0

The ASAP 2.0 corpus contains 1,047 source-based persuasive essays from U.S. school students, holistically graded 1 through 6 by trained human raters. Spearman ρ against the human grades is the convention published essay-scoring work uses. We trained one scoring trait on a held-out portion of the corpus and evaluated it on 1,047 essays alongside eight LLM-as-judge baselines run via Bedrock with thinking-mode prompts.

0.0 0.2 0.4 0.6 0.8 scoring model 0.815 claude opus 4.7 0.813 claude sonnet 4.6 0.724 deepseek v3.2 0.708 llama 4 maverick 0.687 gemma 3 27b 0.659 mistral large 3 0.658 qwen3 32b 0.653 claude haiku 4.5 0.637 spearman ρ vs human raters — higher is better
Figure 1Spearman ρ between each judge's score and the human raters' holistic grades, on 1,047 ASAP 2.0 test essays. Higher is better. The scoring model is highlighted; LLM judges are sorted by ρ.

We measured ρ = 0.815 for the scoring model against the human grades. Claude Opus 4.7 reached ρ = 0.813 on the same essays — a statistical tie at the top. The other seven models in the panel scored between ρ = 0.637 (Claude Haiku 4.5) and ρ = 0.724 (Claude Sonnet 4.6). Per-essay cost was $3.0×10−5 for the scoring model versus $5.8×10−3 for Opus 4.7 (Bedrock on-demand pricing, input + output tokens), and per-essay latency was 204 ms versus 3,828 ms wall-clock. Scoring is deterministic: re-running the same input returns the same score, σ = 0.

Binary spam classification on the UCI SMS Spam Collection

The UCI SMS Spam Collection contains 5,574 SMS messages with gold spam/ham labels. We trained on a held-out training partition and evaluated on 1,115 test messages (149 spam, 966 ham). The scoring model emits a 0–100 spam score per message; we report calibrated metrics at a decision threshold of 50 and the score distribution across both classes.

0% 20% 40% 60% 0 20 40 60 80 100 threshold = 50 spam score · 0–100 ham (n=966) spam (n=149) 71% of ham scores < 10 64% of spam scores 95+
Figure 2Score distribution on the held-out test set, plotted as within-class proportion per 5-point bin so the two classes sit on a comparable axis (spam is ~6× rarer than ham). Decision threshold at 50 marked in dashed orange.

At a threshold of 50, we observed 1,095 of 1,115 messages classified correctly: 13 false positives (ham mislabelled as spam) and 7 false negatives (spam mislabelled as ham). Aggregate metrics: accuracy 0.982, precision 0.916, recall 0.953, F1 0.934, AUC-ROC 0.994. The score distribution is bimodal: 71% of ham fell below 10 and 64% of spam fell at or above 95, with very low density in the middle of the range. The 20 errors concentrate in the overlap band between scores 35 and 75.

AI-vs-human authorship across two distributions

We trained two scoring traits on AI-vs-human authorship: whoami.hc3 on HC3 — the Human–ChatGPT Comparison Corpus, ~2,000 paired samples of human and ChatGPT answers — and whoami.raid on RAID, the Robust AI Detection benchmark, an adversarial collection spanning multiple generation methods, decoding strategies, and writing domains. To test whether each trait generalises beyond its training distribution, we ran both traits against both held-out test sets and computed AUC-ROC for the four combinations.

test distribution HC3 test RAID test trained on whoami.hc3 whoami.raid 0.983 in-distribution 0.738 cross-distribution 0.747 cross-distribution 0.978 in-distribution numbers are AUC-ROC · 1.00 = perfect separation, 0.50 = chance
Figure 3Cross-transfer AUC-ROC. Rows are the training distribution; columns are the test distribution. Diagonal cells (in-distribution) are emphasised; off-diagonal cells (cross-distribution) show the transfer drop.

We measured in-distribution AUC of 0.983 for the HC3-trained trait on HC3 test, and 0.978 for the RAID-trained trait on RAID test. Cross-distribution AUC dropped to 0.747 (HC3-trained applied to RAID test) and 0.738 (RAID-trained applied to HC3 test). A composite that scores against both traits and combines them recovers some of the signal — 0.952 on RAID test and 0.897 on HC3 test — but does not recover the in-distribution numbers in either direction.

Scope AI-text detection from these traits is reliable inside the distribution each was calibrated on, and noticeably less reliable outside it. We surface this directly because most published detectors do not. Scores from whoami.* traits are not authoritative determinations of human-vs-AI authorship and must not be used as the sole basis for academic, employment, or enforcement decisions about any individual.

Attributing text to one of eleven LLMs

We generated samples from 11 LLMs (Anthropic Claude Sonnet 4.6 and Haiku 4.5; Meta Llama 4 Maverick; Mistral Large 3; Amazon Nova 2 Lite; DeepSeek v3.2; OpenAI gpt-oss 120B; Moonshot Kimi K2.5; Alibaba Qwen3 32B; Google Gemma 3 27B; NVIDIA Nemotron Super 3) on a shared prompt set, then trained one scoring trait per LLM on a training partition. At inference we score each held-out sample against all eleven traits and take the argmax as the predicted author. We report top-1 and top-2 accuracy on a held-out set of 219 samples.

0.0 0.2 0.4 0.6 0.8 chance = 9% claude haiku 4.5 75% llama 4 maverick 65% nova 2 lite 65% gemma 3 27b 55% gpt oss 120b 53% kimi k2.5 45% deepseek v3.2 40% nemotron super 3 40% claude sonnet 4.6 35% mistral large 3 35% qwen3 32b 35% top-1 accuracy — correct LLM picked on first guess
Figure 4Per-model top-1 accuracy — the rate at which the correct source LLM is the model's first-guess argmax across the eleven trained traits. Chance baseline (9.09%) marked dashed. Overall: top-1 = 0.493, top-2 = 0.676.

We measured top-1 accuracy of 0.493 across the held-out set, with top-2 accuracy of 0.676. Per-model top-1 ranged from 0.75 (Claude Haiku 4.5) to 0.35 (Claude Sonnet 4.6, Mistral Large 3, Qwen3 32B). When we re-trained one scoring trait per vendor family rather than per model, top-1 accuracy rose to 0.543. The Anthropic family signal in particular was durable enough that 95% of Claude Haiku samples and 70% of Claude Sonnet samples were correctly attributed to Anthropic by a single shared trait.

Scope These scores are exploratory signals about which studied LLM most resembles a piece of text. They are not evidence of authorship, model provenance, or training-data membership. whatami.* outputs must not be used to allege policy violations, support legal claims, or make decisions about individuals. Outputs from each studied LLM remain subject to that provider's usage policy.

Limitations

All four benchmarks share three limitations worth stating directly. First, the scoring models are calibrated on each task's training partition; we have not measured how robustly they transfer across collection methodologies, time periods, or content domains beyond the explicit cross-distribution split tested in whoami. Second, the LLM-as-judge baseline on ASAP uses a single thinking-mode prompt and per-essay JSON parsing; alternative judging protocols (chain-of-thought ensembles, pairwise preference, calibration over multiple temperatures) would shift the comparison and we did not evaluate them. Third, the cost and latency comparisons reflect Bedrock on-demand pricing as of May 2026; these economics will change as serving costs do, and the architectural claim — deterministic, no LLM in the scoring path — is the durable part of the result.

Data sources

Each benchmark uses a publicly available upstream dataset under its original license.

  • ASAP 2.0 Corpus — source-based persuasive essays graded holistically 1–6 by trained human raters. Maintained by The Learning Agency Lab. License: CC-BY-4.0
  • UCI SMS Spam Collection — Almeida, T.A., Gómez Hidalgo, J.M., & Yamakami, A. (2011). Contributions to the study of SMS spam filtering: new collection and results. ACM Symposium on Document Engineering. License: CC-BY-4.0
  • HC3 (Human–ChatGPT Comparison Corpus) — Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., & Wu, Y. (2023). How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. arXiv:2301.07597. License: CC-BY-SA-4.0
  • RAID (Robust AI Detection benchmark) — Dugan, L., Hwang, A., Trhlík, F., Ludan, J.M., Zhu, A., Xu, H., Ippolito, D., & Callison-Burch, C. (2024). RAID: A shared benchmark for robust evaluation of machine-generated text detectors. ACL 2024 (arXiv:2405.07940). License: MIT

Why we're sharing this

We are publishing this benchmark suite because the kinds of claims a content-scoring system makes — about quality, about authorship, about trust — are claims we would want to see numbers for before believing them ourselves. The artifacts let any reader rerun the comparisons, swap in different judges, retrain on different data, and arrive at independent verdicts. Our intent is to keep extending the suite as new tasks become tractable and to publish each benchmark on its own once the methodology survives external scrutiny.

Code and data

Code, model definitions, and raw artifacts: github.com/u22a8/qed-bench. The notebook order under each benchmarks/<name>/ directory is numbered; running them in order reproduces every number on this page.

Open the repository →