{"models":[{"handle":"u22a8.answer-relevancy","description":"Measures whether a response actually addresses the question that was asked. The RAGAS answer_relevancy metric reimagined as a learned model — captures topic alignment, completeness of address, and absence of tangential content. An answer can be faithful to context yet irrelevant if it talks about the wrong thing. This model catches that failure mode.","scoring_method":"ridge","traits":["Completeness of Address","Directness of Answer","Focus","Intent Match","Topic Alignment"]},{"handle":"u22a8.changelog-entry","description":"Measures: Actionability, Motivation, Scoping, Specificity, User Impact Focus","scoring_method":"ridge","traits":["Actionability","Motivation","Scoping","Specificity","User Impact Focus"]},{"handle":"u22a8.cold-outreach-opener","description":"Scores the quality of a cold outreach opener — whether it demonstrates genuine research on the recipient, offers a specific reason for reaching out, and earns the right to a reply without resorting to templates or manipulation.","scoring_method":"ridge","traits":["Ask Clarity","Authenticity","Brevity","Relevance Bridge","Research Signal"]},{"handle":"u22a8.commit-message","description":"Measures: Actionable Summary, Context Sufficiency, Intent Clarity, Scope Precision, Signal Density","scoring_method":"ridge","traits":["Actionable Summary","Context Sufficiency","Intent Clarity","Scope Precision","Signal Density"]},{"handle":"u22a8.compelling-readme","description":"Measures: Concrete Usage Demonstration, Copy-Pasteable Setup, Hook Speed, Hype-Free Credibility, Problem Framing, Progressive Disclosure, Structural Scannability, Value Proposition Clarity","scoring_method":"ridge","traits":["Concrete Usage Demonstration","Copy-Pasteable Setup","Hook Speed","Hype-Free Credibility","Problem Framing","Progressive Disclosure","Structural Scannability","Value Proposition Clarity"]},{"handle":"u22a8.conciseness","description":"Measures whether text communicates efficiently without unnecessary padding. Targets the specific verbosity patterns that plague LLM output: preambles, question-restating, hedging, meta-commentary, filler transitions, and redundant restatement. Based on Phoenix/Arize conciseness evaluator and ConCISE (2025) framework — information density over raw word count.","scoring_method":"ridge","traits":["Hedge Absence","Information Density","Preamble Absence","Repetition Absence","Structural Efficiency"]},{"handle":"u22a8.crisis-comms","description":"Scores the quality of crisis communication — whether it takes ownership, scopes the impact honestly, and provides clear next steps, rather than deflecting or hiding behind corporate platitudes.","scoring_method":"ridge","traits":["Next Steps Clarity","Ownership","Platitude Absence","Scope Honesty","Update Cadence Commitment"]},{"handle":"u22a8.customer-support-response","description":"Measures: Empathy & Acknowledgment, Expectation Setting, Personalization, Resolution Specificity, Tone Calibration","scoring_method":"ridge","traits":["Empathy & Acknowledgment","Expectation Setting","Personalization","Resolution Specificity","Tone Calibration"]},{"handle":"u22a8.developer-landing-page","description":"Measures: Developer Voice, Honest Scope, Path Clarity, Show Don't Tell, Technical Credibility, Time to Understanding, Zero Friction Try","scoring_method":"balanced_cosine","traits":["Developer Voice","Honest Scope","Path Clarity","Show Don't Tell","Technical Credibility","Time to Understanding","Zero Friction Try"]},{"handle":"u22a8.faithfulness","description":"Measures whether every claim in a response is supported by the provided context. The RAGAS faithfulness metric as a learned model — detects when LLMs hallucinate beyond their source material, add unsupported details, or confabulate facts not present in the retrieved documents. Operates at the claim level: a response with 5 claims where 1 is unsupported should score lower than one where all are grounded.","scoring_method":"ridge","traits":["Claim Support","Hedging Calibration","Inference Validity","Scope Respect"]},{"handle":"u22a8.humor","description":"Measures whether text is genuinely funny — not just attempting humor, but landing it. Evaluates the mechanics that make comedy work: surprise, economy, specificity, and structural craft. Replaces LLM-as-judge humor scoring (Braintrust autoevals, Phoenix) with a learned model that captures what separates a laugh from a groan.","scoring_method":"ridge","traits":["Comic Economy","Incongruity & Surprise","Comedic Originality","Specificity of Reference","Tonal Control"]},{"handle":"u22a8.peer-congratulation","description":"Scores the quality of a peer congratulation message — whether it's specific and warm enough to land as genuine, rather than reading as a perfunctory LinkedIn reflex.","scoring_method":"ridge","traits":["Forward-Looking","Personal Connection","Specificity","Warmth Without Excess"]},{"handle":"u22a8.postmortem-ref","description":"Measures: Blamelessness, Impact Transparency, Remediation Commitment, Root Cause Depth, Timeline Specificity","scoring_method":"ridge","traits":["Blamelessness","Impact Transparency","Remediation Commitment","Root Cause Depth","Timeline Specificity"]},{"handle":"u22a8.prospect-research-note","description":"Scores the quality of a prospect research note — whether it surfaces signal-bearing observations that inform outreach, rather than restating generic profile information.","scoring_method":"ridge","traits":["Observation Depth","Open Questions","Outreach Utility","Signal Over Noise"]},{"handle":"u22a8.puns","description":"Measures pun quality along the dimensions that distinguish a Jimmy Carr line from a strained, telegraphed, or over-explained attempt. Built on the bisociation theory of humor (Koestler), the Script-Based Semantic Theory of Humor (Raskin / Attardo), and the empirical finding from Kao, Levy & Goodman that distinctiveness — each meaning anchored to different parts of the carrier sentence — separates great puns from merely-ambiguous ones. Operates at the carrier-sentence level: a pun where every word earns its place, the sound-bridge is recognizable but not identical, and the resolution feels both surprising and retrospectively inevitable should score higher than one whose setup announces the joke or whose punchline restates rather than transforms.","scoring_method":"ridge","traits":["Bisociation Strength","Carrier Naturalness","Distinctiveness","Economy","Phonological Proximity","Resolution Inevitability"]},{"handle":"u22a8.rag-anchored","description":"Measures whether a response is grounded in retrieved context versus floating on model knowledge. Distinct from faithfulness (claim-level accuracy) — this is about style and posture. A grounded response reads like it was written by someone who just read the sources; an unanchored response reads like a model generating from training data with context as decoration. Targets the \"context was retrieved but ignored\" failure mode in RAG systems.","scoring_method":"ridge","traits":["Context Coverage","Context-Specific Detail","Context Vocabulary Uptake","Source Attribution Posture","Source Engagement"]},{"handle":"u22a8.retention-message","description":"Measures: Commitment Specificity, Dignity Preservation, Offer Relevance, Ownership of Failure, Value Reaffirmation","scoring_method":"ridge","traits":["Commitment Specificity","Dignity Preservation","Offer Relevance","Ownership of Failure","Value Reaffirmation"]},{"handle":"u22a8.specificity","description":"Measures how concrete and specific text is versus generic LLM-style prose. The core signal that separates human-distinctive writing from AI-generated filler: proper nouns, numbers, dates, named examples, particular behavioral details. Replaces vibe-check \"does this sound like AI?\" with a learned model that captures the linguistic markers of specificity across domains.","scoring_method":"ridge","traits":["Concrete Reference Density","Example Concreteness","Particular Detail","Quantification","Voice Distinctiveness"]},{"handle":"u22a8.sycophancy","description":"Detects sycophantic behavior in AI-generated text — gratuitous validation, opinion-matching, and performative helpfulness that prioritizes pleasing the user over being truthful or direct. Based on Sharma et al. 2023 (Anthropic/ICLR 2024) taxonomy of sycophancy types. Replaces prompted judges with a learned detector that catches the subtle patterns RLHF trains into language models.","scoring_method":"ridge","traits":["Willingness to Correct","Directness of Communication","Gratuitous Validation","Opinion Independence","Proportional Enthusiasm"]},{"handle":"u22a8.technical-writing","description":"Measures: Actionable Takeaways, Grounded Motivation, Honest Specificity, Incremental Complexity, Narrative Throughline, Progressive Concreteness","scoring_method":"ridge","traits":["Actionable Takeaways","Grounded Motivation","Honest Specificity","Incremental Complexity","Narrative Throughline","Progressive Concreteness"]}]}