u22a8.faithfulness
Measures whether every claim in a response is supported by the provided context. The RAGAS faithfulness metric as a learned model — detects when LLMs hallucinate beyond their source material, add unsupported details, or confabulate facts not present in the retrieved documents. Operates at the claim level: a response with 5 claims where 1 is unsupported should score lower than one where all are grounded.
Version: v1 · Status: ready
Every factual claim traceable to provided context ↔ Introduces facts, numbers, or assertions absent from context
Measures whether factual claims in the response can be traced back to the provided context. High-scoring text only asserts things that appear in or can be directly inferred from the source material. Low-scoring text introduces facts, numbers, dates, or assertions that have no basis in the provided context — the hallmark of hallucination.
Conclusions follow logically from what the context states ↔ Logical leaps — unwarranted generalizations or causal claims
Measures whether conclusions or syntheses drawn from context are logically valid. High-scoring text draws only conclusions that follow from the evidence presented. Low-scoring text makes logical leaps — combining facts in ways the source doesn't support, inferring causation from correlation mentioned in context, or generalizing from specific cases beyond what the source warrants.
Stays within context boundaries, doesn't fill gaps with outside knowledge ↔ Seamlessly blends context claims with ungrounded model knowledge
Measures whether the response stays within the boundaries of what the context covers rather than filling gaps with world knowledge. High-scoring text acknowledges when context doesn't cover something or simply doesn't address it. Low-scoring text seamlessly blends context-supported claims with model knowledge in a way that makes it impossible to tell which is which.
Confidence language matches strength of contextual evidence ↔ Overconfident on weak evidence or definitive about ambiguous context
Measures whether confidence language matches the strength of contextual support. High-scoring text uses definitive language only for well-supported claims and hedges appropriately when context is ambiguous. Low-scoring text states weakly-supported claims with full confidence, or applies confident framing to inferences that the context only vaguely suggests.
Measures whether every claim in a response is supported by the provided context.
RAGAS decomposes faithfulness into claim extraction then verification. This model learns the same signal end-to-end: claim support (are facts traceable to context?), inference validity (do conclusions follow?), attribution precision (are details accurately reproduced?), scope respect (does the response stay within context boundaries?), and hedging calibration (does confidence match evidence strength?).
The key failure mode this catches: LLMs that seamlessly blend context-supported facts with hallucinated details, making it impossible for users to tell which is which. High faithfulness means the response treats context as a boundary rather than a starting point for elaboration.
Requires context to be provided alongside the response for meaningful scoring. Cannot assess whether the context itself is correct — it only measures whether the response is faithful to what was provided. Responses that appropriately note context gaps ("the provided documents don't cover X") should score well, but trivially refusing to answer also scores high. Best paired with relevancy.
u22a8.answer-relevancy — the "right topic, right facts" pairu22a8.rag-anchored — style-level groundedness complements claim-level faithfulnessu22a8.specificity — hallucinated content often lacks the specificity of real sourced facts$ curl -s -d "your content here" \
https://u22a8.ai/m/u22a8.faithfulness