Research

Benchmarks, methodology, and the raw artifacts behind them. We publish work here when the comparisons are reproducible end-to-end and the failure modes are stateable.

Benchmarks May 5, 2026

qed-bench: benchmarking small scoring models against task-appropriate baselines

We trained scoring models on four content-judgment tasks — holistic essay quality, SMS spam, AI-vs-human authorship, and LLM authorship attribution — and compared each one to its task-appropriate baseline: trained human raters, gold labels, or an eight-model LLM-as-judge panel. Notebooks, models, and per-judge artifacts at github.com/u22a8/qed-bench.
Read →

← back to the landing

qed-bench: benchmarking small scoring models against task-appropriate baselines