Pondral

Methodology Changelog

Scoring Methodology Version History

Every change to Pondral's scoring rubric, factor weights, confidence interval calculations, and rater configuration is documented here. We believe reproducibility requires knowing not just how the system works today, but how it has changed over time.

For the current methodology, see the full methodology page. For the design rationale behind our scoring decisions, read How We Built Pondral's Scoring Rubric.


v2.0.3

  • Production audit-pipeline engine update: the ChatGPT adapter moved from OpenAI GPT-4o with the deprecated web_search_preview tool to GPT-5.5 with the current web_search tool, completing the migration the free checker shipped earlier today. OpenAI retires the GPT-4o search-preview path on 2026-07-23. The 5-factor rubric and factor weights are unchanged. Audit results produced after 2026-06-10 carry methodology_version "2.0.3" in the audit_results table.
  • Grounding is now enforced on the audit pipeline's ChatGPT engine: tool_choice "required" forces a live web search on every scored call, and a tripwire records any response that did not actually run one as a failed measurement instead of scoring it.
  • Shadow-audit finding, disclosed for transparency: the outgoing GPT-4o adapter ran a live web search on only 25% of the 24-query validation panel, answering evergreen category questions from training data even with the search tool enabled. This is the same structural defect found and fixed in the Grok adapter in May (v2.0.1).
  • Cutover validation record: correlation with the outgoing adapter was r=0.632, below the 0.75 continuity threshold — expected, because an ungrounded baseline is not a valid comparator. Against the web-grounded Claude reference panel (the gate used for the May Grok cutover) the new adapter scored r=0.777, and r=0.956 after excluding one query affected by a known validation-script substring artifact. Grounding rate 100%, zero errors. Both CTO and CAIO signed off 2026-06-10; the full artifact is at scripts/logs/openai-shadow-audit-2026-06-10T19-25-17.json with the adjudication recorded in docs/engine-adapter-health.md.
  • Expected score movement: grounded answers surface more long-tail brands, so ChatGPT-engine scores for niche brands can move up after 2026-06-10. In the validation panel, every old-vs-new score difference occurred on queries the old adapter had not searched, and the reference engine sided with the new adapter on four of six. Established-brand scores were unchanged. A post-cutover drift check against the first five production audits is scheduled by 2026-06-17.
  • Determinism note: GPT-5.5 does not accept temperature pinning (the old adapter requested temperature 0). Run-to-run variance expectations are unchanged from the v2.0.2 findings; repeated sampling remains available on request.
  • Customer-facing engine labels updated from ChatGPT (GPT-4o) to ChatGPT (GPT-5.5) across the dashboard, audit modals, and methodology pages.

v2026-06-10

  • Free-checker engine update (quick-check and /analyze adapters only; the production audit pipeline's engine roster is unchanged): the ChatGPT adapter moved from OpenAI GPT-4o with the deprecated web_search_preview tool to GPT-5.5 with the current web_search tool. OpenAI retires the GPT-4o search-preview path on 2026-07-23.
  • Grounding is now enforced on the free checker's ChatGPT adapter: tool_choice "required" forces a live web search on every call, and a tripwire rejects any response that did not actually run one. Previously the model could answer from training data; an ungrounded response is no longer scored.
  • Honest-coverage disclosure: free-check responses now report questions attempted vs. completed and engine calls attempted vs. succeeded, and the results UI flags partial scores instead of silently averaging whatever survived.
  • No change to the 5-factor rubric, factor weights, or audit-pipeline scoring. METHODOLOGY_VERSION remains 2.0.1.

v2.0.2

  • Ran the first auditable reproducibility test on the v2.0.x methodology. 5 buyer-intent queries x 5 engines (OpenAI GPT-4o, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Flash, Perplexity Sonar Pro, xAI Grok 4.3) x 3 independent trials. Target brand: Profound (cited in this query topic). Competitors: Otterly, Peec, Pondral. 71 of 75 calls succeeded.
  • Run-level aeo_score variance (mean across every (query, engine) result, the customer-facing number on the dashboard): max absolute deviation 1.74 points across the 3 trials. Within normal run-to-run variance. The fixed ±5-point reproducibility target was subsequently retired; repeated sampling was added as an optional mode.
  • Per-cell (query, engine) result variance: 15 of 23 measurable pairs reproduced within 5 points absolute. 8 of 23 exceeded that band; worst case 38.37 points (Q2 GROK: scores 63.8, 10.0, 71.3 across the 3 trials). Variance is intrinsic to AI engine outputs, not a scoring bug. This finding motivated adding repeated sampling as an available mode for brands that need to measure per-cell consistency.
  • Claim revision: replaced the ambiguous "±5% on re-run" wording across ~22 public surfaces. The fixed-threshold target was retired; standard audits grade one response per (query, engine) pair, and repeated sampling across independent trials is available on request.
  • Methodology v3 roadmap (deferred): multi-run sampling per query, t-distribution confidence intervals on per-cell scores, Cohen's kappa inter-rater statistics. These would tighten per-cell variance toward the run-level bound. Currently unscheduled.
  • Audit artifact: docs/reproducibility-audit-2026-05-19T12-11-51.json (full per-trial scores, deviations, and engine health).

v2.0.1

  • Correction to v2.0.0 entry: the Grok engine adapter listed as "grok-3" actually runs grok-4.3 in production (migrated 2026-05-04, PR #103). Shadow audit post-correction passed at r=0.866 (threshold r≥0.75); grounding rate 100%; zero errors. Both CTO and CAIO signed off 2026-05-04
  • Clarified engine-weight semantics: the planned explicit weighting referenced in v2.0.0 (chatgpt 0.35, claude 0.25, gemini 0.20, grok 0.20) was scaffolding for future use. The actual production AI Visibility Score (formerly AEO Score) formula computes the arithmetic mean of per-result methodology_score values across all engines, so engine weights are implicit at 1÷n_engines today. No engine is explicitly upweighted in the code
  • No customer-facing score change. Audit results produced after 2026-05-10 carry methodology_version "2.0.1" in the audit_results table

v2.0.0

  • Reconciled platform scoring with published methodology: the customer-facing AI Visibility Score (formerly AEO Score) is now the arithmetic mean of per-observation 5-factor composite scores (Presence, Prominence, Context, Citation Link, Competitive Presence)
  • Retired the interim 4-component formula (Brand SOV 15% + Generic SOV 45% + Owned Citations 30% + Grounded Mentions 10%) that had been in use since launch
  • Context factor (Factor 3) graded by Claude Haiku via dedicated LLM judge per observation, replacing keyword-based sentiment heuristic
  • Prominence factor uses character-ratio position (brand appearance offset / response length), consistent across all engines
  • Per-result methodology scores stored in audit_results.methodology_score alongside raw data for full auditability
  • SOV metrics (brand-scope, generic-scope, per-theme, per-competitor) remain unchanged and continue to appear on dashboards alongside the reconciled AI Visibility Score

v2026-04-16

  • Published initial methodology documentation at /methodology
  • Defined 5-factor scoring rubric: Presence (20%), Prominence (25%), Context (20%), Citation Link (20%), Competitive Presence (15%)
  • Established ordinal bucket scoring (0, 25, 50, 75, 100) for all factors
  • Published initial reproducibility approach. The original fixed-threshold target was subsequently retired; standard audits grade one response per (query, engine) pair, with repeated sampling across independent trials available on request. Drift above expected variance is logged in this changelog with a root-cause note. Per-cell variance is visible in raw evidence.
  • Configured separate LLM rater for Context factor (Factor 3) to avoid self-assessment bias
  • Launched "View raw" transparency feature: every score shows prompt, response, timestamp, and rater model version
  • Methodology roadmap items (not in this release): multi-run sampling per query, t-distribution confidence intervals, IQR outlier detection, Cohen's kappa inter-rater agreement statistics. Targeted for methodology v3

v2026-04-23

  • Published detailed design rationale at /blog/how-pondral-scoring-works
  • Documented weight selection process: Prominence weighted highest (25%) based on click-through correlation backtests
  • Documented Competitive Presence weight reduction from original 25% to 15% due to volatility concerns
  • Documented scoring bucket expansion from original 3-bucket (0, 50, 100) to 5-bucket (0, 25, 50, 75, 100) system
  • Added AEO glossary at /blog/aeo-glossary with DefinedTermSet schema for all scoring terminology