Each one a single eval metric, learned — the RAG, quality, and safety checks you’d otherwise hand to an LLM judge. Open any lens to see it score, or bring your own content alongside.
qed-bench · four tasks, one pattern
Every scoring model is compared to its task-appropriate baseline — trained human raters, gold labels, or an eight-model LLM-as-judge panel. It matches the best of them, at a fraction of the cost.
Seriously — talk to the MLE.
Name’s Taras. He built this, and he’s interested in scaling your evals. No sign-up, no sales call — just the person who wrote the code, one message away.
FAQ
DLM (Discriminative Language Model) runs on an encoder-only architecture and drops a generative layer. Put simply, reads meaning straight from text and scores it in one pass.
No. Cosine similarity tells you how close two texts are, not whether one is good. It has no notion of the criterion you’re scoring, like faithfulness, relevance, tone. A DLM learns that criterion.
When you need the judge to argue — a written rationale, a paragraph defending the verdict. U/=22A8 gives you a score and its confidence, not prose; for that explanation, an LLM is still the right tool.
Fair — it’s a strange idea the first time you meet it. The MLE is notoriously happy to explain it in more detail than you probably want. Talk to the MLE and ask anything.