The fabric of information space.

Evals for your AI stack, without generating output tokens. 23× cheaper than the typical LLM-as-a-judge — more accurate, lower latency, deterministic.

vs Claude Opus 4.7, the only judge that matches our quality — 195× cheaper, 19× faster. qed-bench →

more accurate ↑ cheaper →

No trade-off.

  • 23× cheaper than the typical LLM-as-a-judge195× vs Opus 4.7
  • +18% more accurate than the average LLM-as-a-judge
  • 19× faster~200 ms a call
  • σ = 0 deterministicsame input, same score

The fabric architecture

A library of trained DLMs.

Each one a single eval metric, learned — the RAG, quality, and safety checks you’d otherwise hand to an LLM judge. Open any lens to see it score, or bring your own content alongside.

qed-bench · four tasks, one pattern

Measured against the strongest baseline for each task.

Every scoring model is compared to its task-appropriate baseline — trained human raters, gold labels, or an eight-model LLM-as-judge panel. It matches the best of them, at a fraction of the cost.

Holistic essay quality · ASAP 2.0 · cost vs agreement with 1,047 human-graded essays
0.60 0.70 0.80 $10⁻⁵ $10⁻⁴ $10⁻³ $10⁻² cost per essay · log scale · cheaper ← agreement (Spearman ρ) ↑ Claude Opus 4.7 ρ 0.813 · $0.0058 Claude Sonnet 4.6 DeepSeek v3.2 Llama 4 Maverick Gemma 3 27B Mistral Large 3 Qwen3 32B Claude Haiku 4.5 scoring model ρ 0.815 · $0.00003 · σ = 0

Seriously — talk to the MLE.

Name’s Taras. He built this, and he’s interested in scaling your evals. No sign-up, no sales call — just the person who wrote the code, one message away.

FAQ

How do you evaluate without an LLM?

DLM (Discriminative Language Model) runs on an encoder-only architecture and drops a generative layer. Put simply, reads meaning straight from text and scores it in one pass.

Isn’t this just embedding similarity?

No. Cosine similarity tells you how close two texts are, not whether one is good. It has no notion of the criterion you’re scoring, like faithfulness, relevance, tone. A DLM learns that criterion.

When should I still use an LLM?

When you need the judge to argue — a written rationale, a paragraph defending the verdict. U/=22A8 gives you a score and its confidence, not prose; for that explanation, an LLM is still the right tool.

Huh?

Fair — it’s a strange idea the first time you meet it. The MLE is notoriously happy to explain it in more detail than you probably want. Talk to the MLE and ask anything.