u22a8.rag-anchored
Measures whether a response is grounded in retrieved context versus floating on model knowledge. Distinct from faithfulness (claim-level accuracy) — this is about style and posture. A grounded response reads like it was written by someone who just read the sources; an unanchored response reads like a model generating from training data with context as decoration. Targets the "context was retrieved but ignored" failure mode in RAG systems.
Version: v1 · Status: ready
Actively references, builds on, or synthesizes from source material ↔ Could have been written without seeing the context — generic on-topic prose
Measures whether the response actively engages with the provided context — referencing specific passages, building on particular details, or synthesizing across source sections. High-scoring text reads like it was composed while looking at the sources. Low-scoring text could have been written without ever seeing the context — generic knowledge-based prose that happens to be on-topic.
Uses terminology, names, and framings from the source material ↔ Substitutes generic synonyms, ignores source-specific vocabulary
Measures whether the response uses terminology, phrasing, and entities from the provided context rather than substituting generic synonyms. High-scoring text adopts the source's vocabulary — using the same names, terms, and framings. Low-scoring text paraphrases everything into generic language, replacing specific context terms with vaguer alternatives.
Surfaces non-obvious details found only in the provided context ↔ States only general domain knowledge available without the context
Measures whether the response includes details that could only come from the provided context rather than general knowledge. High-scoring text surfaces non-obvious information from the sources — specific configurations, edge cases, or nuances found only there. Low-scoring text sticks to general truths about the topic that anyone familiar with the domain would know without the context.
Positions itself as reporting from sources, signals provenance ↔ Presents context-derived information as authoritative model knowledge
Measures whether the response's framing signals awareness of its sources — "according to the documentation", "the config shows", "based on the retrieved data" — versus presenting everything as authoritative model knowledge. High-scoring text positions itself as reporting from sources. Low-scoring text positions itself as an authority generating knowledge, even when that knowledge came from context.
Synthesizes across multiple relevant context passages ↔ Latches onto one fragment, ignores other relevant retrieved material
Measures whether the response uses the breadth of provided context rather than cherry-picking one fragment and ignoring the rest. High-scoring text synthesizes across multiple context passages or sections when relevant. Low-scoring text latches onto one sentence or chunk and ignores other retrieved material that would enrich or qualify the answer.
Measures whether a response is grounded in retrieved context versus floating on model knowledge.
Distinct from faithfulness (which checks claim-level accuracy), this model captures the posture and style of a grounded response. Does it read like someone who just read the sources and is reporting from them? Or does it read like a model generating from training data, with retrieved context as window dressing? This targets the "context was retrieved but ignored" failure mode common in RAG systems.
Five traits: source engagement (actively references and builds on source material), context vocabulary uptake (uses the source's terminology rather than generic synonyms), context-specific detail (surfaces non-obvious information from context), source attribution posture (signals provenance rather than presenting as authoritative), and context coverage (synthesizes across retrieved material rather than cherry-picking one fragment).
Some response styles legitimately synthesize context into a natural voice without explicit attribution — this isn't necessarily a quality failure. The model may penalize well-integrated responses that fully absorb context but present it conversationally. Works best when the evaluation use case values traceability and transparency over seamless prose.
u22a8.faithfulness — claim-level accuracy + style-level groundednessu22a8.answer-relevancy — together they form the full RAG evaluation triadu22a8.specificity — context-grounded responses tend to be specific$ curl -s -d "your content here" \
https://u22a8.ai/m/u22a8.rag-anchored