U⊨22A8
new
models
Understanding scores
Models What a model is and why it matters. Score card Anatomy of a score — what every element means. Traits Independent dimensions, each scored separately. Tiers Weak, Developing, Solid, Strong. Confidence How reliable is this score.
Integrate
REST API The main integration. HTTP, JSON, one endpoint.
research
punchlines
U⊨22A8
⊨ Models
Browse models Public preview catalog Compare models Side by side
⊨ Docs
Models What a model is Score card Anatomy of a score Traits Dimensions of quality Tiers Weak to Strong Confidence How reliable is this score
⊨ Integrate
REST API Main integration — HTTP/JSON
⊨ Research
qed-bench Benchmarks against task-appropriate baselines
punchlines
U⊨22A8 · built by @onebit0fme · Terms · Privacy

Research

Benchmarks, methodology, and the raw artifacts behind them. We publish work here when the comparisons are reproducible end-to-end and the failure modes are stateable.

  • Benchmarks May 5, 2026

    qed-bench: benchmarking small scoring models against task-appropriate baselines

    We trained scoring models on four content-judgment tasks — holistic essay quality, SMS spam, AI-vs-human authorship, and LLM authorship attribution — and compared each one to its task-appropriate baseline: trained human raters, gold labels, or an eight-model LLM-as-judge panel. Notebooks, models, and per-judge artifacts at github.com/u22a8/qed-bench.

    Read →
← back to the landing
U⊨22A8 · built by @onebit0fme · Terms · Privacy