Last week's max drift: 0.0 pp
Pondral runs the same frozen evaluation suite every week against every production engine. We compare each run to the prior week and publish the per-engine drift. The number above is the largest engine-level change between the two most recent runs.
Per-engine drift, last run vs prior
| Engine | Last run mean | Prior week mean | Delta | Status |
|---|---|---|---|---|
| ChatGPT | 75.0 | 75.0 | 0.0 pp | Stable |
| Claude | 75.0 | 75.0 | 0.0 pp | Stable |
| Gemini | 75.0 | 75.0 | 0.0 pp | Stable |
| Perplexity | 75.0 | 75.0 | 0.0 pp | Stable |
| Grok | 75.0 | 75.0 | 0.0 pp | Stable |
A delta under 5 percentage points is within normal run-to-run variance. 5 to 7.5 pp is flagged as moderate. Anything 7.5 pp or higher is flagged as drift and surfaced in the next methodology changelog entry with an explanation.
Recent weekly runs
| Run date | Overall mean | ChatGPT | Claude | Gemini | Perplexity | Grok |
|---|---|---|---|---|---|---|
| May 17, 2026 | 75.0 | — | 75.0 | 75.0 | 75.0 | 75.0 |
| May 24, 2026 | 75.0 | 75.0 | 75.0 | 75.0 | 75.0 | 75.0 |
| May 31, 2026 | 75.0 | 75.0 | — | 75.0 | 75.0 | 75.0 |
| Jun 7, 2026 | 75.0 | 75.0 | 75.0 | 75.0 | 75.0 | 75.0 |
| Jun 14, 2026 | 75.0 | 75.0 | 75.0 | 75.0 | 75.0 | 75.0 |
| Jun 21, 2026 | 75.0 | 75.0 | 75.0 | 75.0 | 75.0 | 75.0 |
How this works
Every week a cron job runs a fixed, frozen set of eval queries against every production engine. Each (query, engine) response is scored on a deterministic 4-factor subset of the rubric. Context is excluded from this eval because it uses an LLM judge whose own non-determinism would pollute the reproducibility signal. The full 5-factor rubric is the one customer scores use.
For the full rubric, weights, and rater configuration, see the methodology page. Every methodology change is logged in the changelog.
Methodology v2.0.3 · cohort v0-2026-05 · last run Jun 21, 2026 (1 week ago)