Pondral
← Back to methodology
Reproducibility status

Last week's max drift: 0.0 pp

Pondral runs the same frozen evaluation suite every week against every production engine. We compare each run to the prior week and publish the per-engine drift. The number above is the largest engine-level change between the two most recent runs.

Per-engine drift, last run vs prior

EngineLast run meanPrior week meanDeltaStatus
ChatGPT75.075.00.0 ppStable
Claude75.075.00.0 ppStable
Gemini75.075.00.0 ppStable
Perplexity75.075.00.0 ppStable
Grok75.075.00.0 ppStable

A delta under 5 percentage points is within normal run-to-run variance. 5 to 7.5 pp is flagged as moderate. Anything 7.5 pp or higher is flagged as drift and surfaced in the next methodology changelog entry with an explanation.

Recent weekly runs

Run dateOverall meanChatGPTClaudeGeminiPerplexityGrok
May 17, 202675.075.075.075.075.0
May 24, 202675.075.075.075.075.075.0
May 31, 202675.075.075.075.075.0
Jun 7, 202675.075.075.075.075.075.0
Jun 14, 202675.075.075.075.075.075.0
Jun 21, 202675.075.075.075.075.075.0

How this works

Every week a cron job runs a fixed, frozen set of eval queries against every production engine. Each (query, engine) response is scored on a deterministic 4-factor subset of the rubric. Context is excluded from this eval because it uses an LLM judge whose own non-determinism would pollute the reproducibility signal. The full 5-factor rubric is the one customer scores use.

For the full rubric, weights, and rater configuration, see the methodology page. Every methodology change is logged in the changelog.

Methodology v2.0.3 · cohort v0-2026-05 · last run Jun 21, 2026 (1 week ago)

Last verified 1 week agoScore your brand free