u22a8.ai — evals without the LLM

U⊨22A8

⊨ Models

Browse models Public preview catalog Compare models Side by side

⊨ Plugins

Claude Code plugin Score & improve from your editor

⊨ Docs

Models What a model is Score card Anatomy of a score Traits Dimensions of quality Tiers Weak to Strong Confidence How reliable is this score

⊨ Integrate

REST API Main integration — HTTP/JSON MCP Connect any MCP client GitHub Action Score docs on every PR

⊨ Research

qed-bench Benchmarks against task-appropriate baselines

punchlines

U22A8 · built by @onebit0fme · Terms · Privacy

The fabric of information space.

Evals for your AI stack, without generating output tokens. 23× cheaper than the typical LLM-as-a-judge — more accurate, lower latency, deterministic.†

† vs Claude Opus 4.7, the only judge that matches our quality — 195× cheaper, 19× faster. qed-bench →

Talk to the MLE

No trade-off.

23× cheaper than the typical LLM-as-a-judge195× vs Opus 4.7
+18% more accurate than the average LLM-as-a-judge
19× faster~200 ms a call
σ = 0 deterministicsame input, same score

The fabric architecture

Eval criteria prompt · rubric · examples

U⊨22A8

Encoders

Semantic Store

DLM

online · low-latency · highly scalable

LLM-as-a-judge retired

Evals Braintrust · LangSmith · Langfuse · DeepEval

Eval criteria prompt · rubric · examples

↓ polymorphic input

LLM-as-a-judge retired

U⊨22A8

Encoders

Semantic Store

DLM

online · low-latency · highly scalable

↓ score

Evals Braintrust · LangSmith · Langfuse · DeepEval

A library of trained DLMs.

Each one a single eval metric, learned — the RAG, quality, and safety checks you’d otherwise hand to an LLM judge. Open any lens to see it score, or bring your own content alongside.

Browse the full catalog →

qed-bench · four tasks, one pattern

Measured against the strongest baseline for each task.

Every scoring model is compared to its task-appropriate baseline — trained human raters, gold labels, or an eight-model LLM-as-judge panel. It matches the best of them, at a fraction of the cost.

Holistic essay quality · ASAP 2.0 · cost vs agreement with 1,047 human-graded essays

Same agreement with human graders as the frontier model — ρ 0.815 vs 0.813 — at ≈195× lower cost, ≈19× faster, deterministically.

About the benchmark

ASAP 2.0 — 1,047 source-based student essays, holistically graded 1–6 by trained human raters. We score each essay and rank methods by Spearman ρ against those grades, alongside an eight-model LLM-as-judge panel run on Amazon Bedrock.

Methodology

Each point is one scoring method: agreement with humans (ρ, vertical) against cost per essay (horizontal, log, Bedrock on-demand). Up and to the left is better — the scoring model sits there alone, and unlike the panel it returns the same score every time.

Seriously — talk to the MLE.

Name’s Taras. He built this, and he’s interested in scaling your evals. No sign-up, no sales call — just the person who wrote the code, one message away.

Talk to the MLE

FAQ

How do you evaluate without an LLM?

DLM (Discriminative Language Model) runs on an encoder-only architecture and drops a generative layer. Put simply, reads meaning straight from text and scores it in one pass.

Isn’t this just embedding similarity?

No. Cosine similarity tells you how close two texts are, not whether one is good. It has no notion of the criterion you’re scoring, like faithfulness, relevance, tone. A DLM learns that criterion.

When should I still use an LLM?

When you need the judge to argue — a written rationale, a paragraph defending the verdict. U/=22A8 gives you a score and its confidence, not prose; for that explanation, an LLM is still the right tool.

Huh?

Fair — it’s a strange idea the first time you meet it. The MLE is notoriously happy to explain it in more detail than you probably want. Talk to the MLE and ask anything.