Pondral
← All postsJournal

Inter-rater reliability for AI answers.

Published Feb 12, 2026By Pondral TeamRead time 14 min read

Editor's note (2026-05-18):this post describes inter-rater reliability concepts and the methodology Pondral plans to adopt at scale. Current implementation grades one response per (query, engine) pair and reports the mean across the audit; repeated sampling (Mention Rate, Quality When Mentioned, and a combined Visibility Index) is available on request, not enabled on standard audits. Multi-run averaging by default, Cohen's kappa flagging, and 95% confidence intervals are roadmap items for further statistical rigor. See /methodology for the shipped specification.

A question we get often: "How do you handle the fact that AI models give different answers each run?"

The short answer is: with statistics. The long answer involves Cohen's kappa, bootstrapped confidence intervals, and a methodology paper we'll publish later this quarter.

The framework we describe below is the published end-state. Today, Pondral runs each (query, engine) pair once and reports the mean across all results in the audit. Cohen's kappa flagging and multi-run prompt variations are slated for the methodology v3 release. For a plain-English explanation of each factor we measure, see the five-factor rubric, explained slowly.

The shipped behavior is: a single sample per (query, engine), aggregated into a mean score. We publish the methodology change log so every score is traceable to a specific rubric version and run configuration. See methodology.

Last updated June 2026Run a free audit