Same content, different quality standards. See what each model measures — and where the scores diverge.