qed-bench: benchmarking U⊨22A8 scoring models against task-appropriate baselines

BenchmarksMay 5, 2026

We trained scoring models on four content-judgment tasks — holistic essay quality (ASAP 2.0), SMS spam classification (UCI), AI-vs-human authorship detection (HC3 and RAID), and LLM authorship attribution across eleven LLMs — and paired each with a task-appropriate baseline: trained human raters, gold class labels, or an eight-model LLM-as-judge panel served via Amazon Bedrock.

Across the four tasks, the scoring models match or exceed the strongest available baseline at one to two orders of magnitude lower cost and latency, deterministically. Where the result breaks — most visibly on cross-distribution AI-vs-human detection — the failure mode is concentrated and stateable. Notebooks, model definitions, and per-judge artifacts are at github.com/u22a8/qed-bench.

ASAP 2.0 · essay quality ρ = 0.815 vs human raters; matches Claude Opus 4.7 (ρ = 0.813); other panel models 0.637 – 0.724

UCI SMS Spam acc 0.982 F1 = 0.934, AUC-ROC = 0.994; 20 errors on 1,115 held-out messages

whoami · AI vs human AUC 0.98 → 0.74 in-distribution AUC ~0.98; cross-distribution AUC drops to ~0.74

whatami · LLM attribution top-1 0.493 11-way classification; chance = 0.091; top-2 = 0.676; per-family top-1 = 0.543

Methods

Each scoring model is the U22A8 SDK's primitive for content judgment, fit from a small set of labelled examples and served deterministically: the same input always returns the same score, σ = 0. The same scoring API serves all four models; only the trait definitions and training data differ. Test sets are held out at training time, and baselines are matched per task. ASAP uses an LLM-as-judge panel because the only human ratings available are the ground truth being predicted, not a separate baseline; sms-spam uses gold class labels directly; whoami and whatami report AUC and class-accuracy metrics on held-out test sets. All inputs are scored via the public, unauthenticated POST /m/{handle} endpoint that any reader can call.

asap

Holistic essay quality on ASAP 2.0

The ASAP 2.0 corpus contains 1,047 source-based persuasive essays from U.S. school students, holistically graded 1 through 6 by trained human raters. Spearman ρ against the human grades is the convention published essay-scoring work uses. We trained one scoring trait on a held-out portion of the corpus and evaluated it on 1,047 essays alongside eight LLM-as-judge baselines run via Bedrock with thinking-mode prompts.

Figure 1Spearman ρ between each judge's score and the human raters' holistic grades, on 1,047 ASAP 2.0 test essays. Higher is better. The scoring model is highlighted; LLM judges are sorted by ρ.

We measured ρ = 0.815 for the scoring model against the human grades. Claude Opus 4.7 reached ρ = 0.813 on the same essays — a statistical tie at the top. The other seven models in the panel scored between ρ = 0.637 (Claude Haiku 4.5) and ρ = 0.724 (Claude Sonnet 4.6). Per-essay cost was $3.0×10⁻⁵ for the scoring model versus $5.8×10⁻³ for Opus 4.7 (Bedrock on-demand pricing, input + output tokens), and per-essay latency was 204 ms versus 3,828 ms wall-clock. Scoring is deterministic: re-running the same input returns the same score, σ = 0.

sms-spam

Binary spam classification on the UCI SMS Spam Collection

The UCI SMS Spam Collection contains 5,574 SMS messages with gold spam/ham labels. We trained on a held-out training partition and evaluated on 1,115 test messages (149 spam, 966 ham). The scoring model emits a 0–100 spam score per message; we report calibrated metrics at a decision threshold of 50 and the score distribution across both classes.

Figure 2Score distribution on the held-out test set, plotted as within-class proportion per 5-point bin so the two classes sit on a comparable axis (spam is ~6× rarer than ham). Decision threshold at 50 marked in dashed orange.

At a threshold of 50, we observed 1,095 of 1,115 messages classified correctly: 13 false positives (ham mislabelled as spam) and 7 false negatives (spam mislabelled as ham). Aggregate metrics: accuracy 0.982, precision 0.916, recall 0.953, F1 0.934, AUC-ROC 0.994. The score distribution is bimodal: 71% of ham fell below 10 and 64% of spam fell at or above 95, with very low density in the middle of the range. The 20 errors concentrate in the overlap band between scores 35 and 75.

whoami

AI-vs-human authorship across two distributions

We trained two scoring traits on AI-vs-human authorship: whoami.hc3 on HC3 — the Human–ChatGPT Comparison Corpus, ~2,000 paired samples of human and ChatGPT answers — and whoami.raid on RAID, the Robust AI Detection benchmark, an adversarial collection spanning multiple generation methods, decoding strategies, and writing domains. To test whether each trait generalises beyond its training distribution, we ran both traits against both held-out test sets and computed AUC-ROC for the four combinations.

Figure 3Cross-transfer AUC-ROC. Rows are the training distribution; columns are the test distribution. Diagonal cells (in-distribution) are emphasised; off-diagonal cells (cross-distribution) show the transfer drop.

We measured in-distribution AUC of 0.983 for the HC3-trained trait on HC3 test, and 0.978 for the RAID-trained trait on RAID test. Cross-distribution AUC dropped to 0.747 (HC3-trained applied to RAID test) and 0.738 (RAID-trained applied to HC3 test). A composite that scores against both traits and combines them recovers some of the signal — 0.952 on RAID test and 0.897 on HC3 test — but does not recover the in-distribution numbers in either direction.

Scope AI-text detection from these traits is reliable inside the distribution each was calibrated on, and noticeably less reliable outside it. We surface this directly because most published detectors do not. Scores from whoami.* traits are not authoritative determinations of human-vs-AI authorship and must not be used as the sole basis for academic, employment, or enforcement decisions about any individual.

whatami

Attributing text to one of eleven LLMs

We generated samples from 11 LLMs (Anthropic Claude Sonnet 4.6 and Haiku 4.5; Meta Llama 4 Maverick; Mistral Large 3; Amazon Nova 2 Lite; DeepSeek v3.2; OpenAI gpt-oss 120B; Moonshot Kimi K2.5; Alibaba Qwen3 32B; Google Gemma 3 27B; NVIDIA Nemotron Super 3) on a shared prompt set, then trained one scoring trait per LLM on a training partition. At inference we score each held-out sample against all eleven traits and take the argmax as the predicted author. We report top-1 and top-2 accuracy on a held-out set of 219 samples.

Figure 4Per-model top-1 accuracy — the rate at which the correct source LLM is the model's first-guess argmax across the eleven trained traits. Chance baseline (9.09%) marked dashed. Overall: top-1 = 0.493, top-2 = 0.676.

We measured top-1 accuracy of 0.493 across the held-out set, with top-2 accuracy of 0.676. Per-model top-1 ranged from 0.75 (Claude Haiku 4.5) to 0.35 (Claude Sonnet 4.6, Mistral Large 3, Qwen3 32B). When we re-trained one scoring trait per vendor family rather than per model, top-1 accuracy rose to 0.543. The Anthropic family signal in particular was durable enough that 95% of Claude Haiku samples and 70% of Claude Sonnet samples were correctly attributed to Anthropic by a single shared trait.

Scope These scores are exploratory signals about which studied LLM most resembles a piece of text. They are not evidence of authorship, model provenance, or training-data membership. whatami.* outputs must not be used to allege policy violations, support legal claims, or make decisions about individuals. Outputs from each studied LLM remain subject to that provider's usage policy.

← all research