qed-bench: benchmarking U⊨22A8 scoring models against task-appropriate baselines
We trained scoring models on four content-judgment tasks — holistic essay quality (ASAP 2.0), SMS spam classification (UCI), AI-vs-human authorship detection (HC3 and RAID), and LLM authorship attribution across eleven LLMs — and paired each with a task-appropriate baseline: trained human raters, gold class labels, or an eight-model LLM-as-judge panel served via Amazon Bedrock.
Across the four tasks, the scoring models match or exceed the strongest available baseline at one to two orders of magnitude lower cost and latency, deterministically. Where the result breaks — most visibly on cross-distribution AI-vs-human detection — the failure mode is concentrated and stateable. Notebooks, model definitions, and per-judge artifacts are at github.com/u22a8/qed-bench.
Methods
Each scoring model is the U22A8 SDK's primitive for content judgment, fit from a small set of labelled examples and served deterministically: the same input always returns the same score, σ = 0. The same scoring API serves all four models; only the trait definitions and training data differ. Test sets are held out at training time, and baselines are matched per task. ASAP uses an LLM-as-judge panel because the only human ratings available are the ground truth being predicted, not a separate baseline; sms-spam uses gold class labels directly; whoami and whatami report AUC and class-accuracy metrics on held-out test sets. All inputs are scored via the public, unauthenticated POST /m/{handle} endpoint that any reader can call.
asap
Holistic essay quality on ASAP 2.0
The ASAP 2.0 corpus contains 1,047 source-based persuasive essays from U.S. school students, holistically graded 1 through 6 by trained human raters. Spearman ρ against the human grades is the convention published essay-scoring work uses. We trained one scoring trait on a held-out portion of the corpus and evaluated it on 1,047 essays alongside eight LLM-as-judge baselines run via Bedrock with thinking-mode prompts.
We measured ρ = 0.815 for the scoring model against the human grades. Claude Opus 4.7 reached ρ = 0.813 on the same essays — a statistical tie at the top. The other seven models in the panel scored between ρ = 0.637 (Claude Haiku 4.5) and ρ = 0.724 (Claude Sonnet 4.6). Per-essay cost was $3.0×10−5 for the scoring model versus $5.8×10−3 for Opus 4.7 (Bedrock on-demand pricing, input + output tokens), and per-essay latency was 204 ms versus 3,828 ms wall-clock. Scoring is deterministic: re-running the same input returns the same score, σ = 0.
sms-spam
Binary spam classification on the UCI SMS Spam Collection
The UCI SMS Spam Collection contains 5,574 SMS messages with gold spam/ham labels. We trained on a held-out training partition and evaluated on 1,115 test messages (149 spam, 966 ham). The scoring model emits a 0–100 spam score per message; we report calibrated metrics at a decision threshold of 50 and the score distribution across both classes.
At a threshold of 50, we observed 1,095 of 1,115 messages classified correctly: 13 false positives (ham mislabelled as spam) and 7 false negatives (spam mislabelled as ham). Aggregate metrics: accuracy 0.982, precision 0.916, recall 0.953, F1 0.934, AUC-ROC 0.994. The score distribution is bimodal: 71% of ham fell below 10 and 64% of spam fell at or above 95, with very low density in the middle of the range. The 20 errors concentrate in the overlap band between scores 35 and 75.
whoami
AI-vs-human authorship across two distributions
We trained two scoring traits on AI-vs-human authorship: whoami.hc3 on HC3 — the Human–ChatGPT Comparison Corpus, ~2,000 paired samples of human and ChatGPT answers — and whoami.raid on RAID, the Robust AI Detection benchmark, an adversarial collection spanning multiple generation methods, decoding strategies, and writing domains. To test whether each trait generalises beyond its training distribution, we ran both traits against both held-out test sets and computed AUC-ROC for the four combinations.
We measured in-distribution AUC of 0.983 for the HC3-trained trait on HC3 test, and 0.978 for the RAID-trained trait on RAID test. Cross-distribution AUC dropped to 0.747 (HC3-trained applied to RAID test) and 0.738 (RAID-trained applied to HC3 test). A composite that scores against both traits and combines them recovers some of the signal — 0.952 on RAID test and 0.897 on HC3 test — but does not recover the in-distribution numbers in either direction.
whoami.* traits are not authoritative determinations of human-vs-AI authorship and must not be used as the sole basis for academic, employment, or enforcement decisions about any individual.
whatami
Attributing text to one of eleven LLMs
We generated samples from 11 LLMs (Anthropic Claude Sonnet 4.6 and Haiku 4.5; Meta Llama 4 Maverick; Mistral Large 3; Amazon Nova 2 Lite; DeepSeek v3.2; OpenAI gpt-oss 120B; Moonshot Kimi K2.5; Alibaba Qwen3 32B; Google Gemma 3 27B; NVIDIA Nemotron Super 3) on a shared prompt set, then trained one scoring trait per LLM on a training partition. At inference we score each held-out sample against all eleven traits and take the argmax as the predicted author. We report top-1 and top-2 accuracy on a held-out set of 219 samples.
We measured top-1 accuracy of 0.493 across the held-out set, with top-2 accuracy of 0.676. Per-model top-1 ranged from 0.75 (Claude Haiku 4.5) to 0.35 (Claude Sonnet 4.6, Mistral Large 3, Qwen3 32B). When we re-trained one scoring trait per vendor family rather than per model, top-1 accuracy rose to 0.543. The Anthropic family signal in particular was durable enough that 95% of Claude Haiku samples and 70% of Claude Sonnet samples were correctly attributed to Anthropic by a single shared trait.
whatami.* outputs must not be used to allege policy violations, support legal claims, or make decisions about individuals. Outputs from each studied LLM remain subject to that provider's usage policy.