Reproducibility status

Last week's max drift: 0.0 pp

Pondral runs the same frozen evaluation suite every week against every production engine. We compare each run to the prior week and publish the per-engine drift. The number above is the largest engine-level change between the two most recent runs.

Per-engine drift, last run vs prior

Engine	Last run mean	Prior week mean	Status
ChatGPT	75.0	75.0	Stable
Claude	75.0	75.0	Stable
Gemini	75.0	75.0	Stable
Perplexity	75.0	75.0	Stable
Grok	75.0	75.0	Stable

A delta under 5 percentage points is within normal run-to-run variance. 5 to 7.5 pp is flagged as moderate. Anything 7.5 pp or higher is flagged as drift and surfaced in the next methodology changelog entry with an explanation.

Recent weekly runs

Run date	Overall mean	ChatGPT	Claude	Gemini	Perplexity	Grok
May 17, 2026	75.0	—	75.0	75.0	75.0	75.0
May 24, 2026	75.0	75.0	75.0	75.0	75.0	75.0
May 31, 2026	75.0	75.0	—	75.0	75.0	75.0
Jun 7, 2026	75.0	75.0	75.0	75.0	75.0	75.0
Jun 14, 2026	75.0	75.0	75.0	75.0	75.0	75.0
Jun 21, 2026	75.0	75.0	75.0	75.0	75.0	75.0

How this works

Every week a cron job runs a fixed, frozen set of eval queries against every production engine. Each (query, engine) response is scored on a deterministic 4-factor subset of the rubric. Context is excluded from this eval because it uses an LLM judge whose own non-determinism would pollute the reproducibility signal. The full 5-factor rubric is the one customer scores use.

For the full rubric, weights, and rater configuration, see the methodology page. Every methodology change is logged in the changelog.

Methodology v2.0.3 · cohort v0-2026-05 · last run Jun 21, 2026 (1 week ago)

Last verified 1 week agoScore your brand free →