A question we get often: "How do you handle the fact that AI models give different answers each run?"
The short answer is: with statistics. The long answer involves Cohen's kappa, bootstrapped confidence intervals, and a methodology paper we'll publish later this quarter.
For each query in a Pondral audit, we run the prompt three times per engine, with mild prompt variations (paraphrases, ordering tweaks). We compute mean scores and 95% confidence intervals, and we publish both. The full design rationale behind this approach is explained in our post on how Pondral's scoring rubric was built.
Across engines, we calculate a kappa statistic for citation agreement. When kappa drops below 0.6 (meaning the engines genuinely disagree about whether you should be cited) we flag the query as low-confidence in your report. For a plain-English explanation of each factor we measure, see the five-factor rubric, explained slowly.
The result is that you can trust a Pondral score the same way you'd trust a poll: it's a probabilistic estimate, with a known error bar, derived from a stated methodology.