What did Pondral find when it audited itself?

Pondral discovered that its published five-factor methodology and its production code used two different scoring formulas. The published formula was a weighted sum of Presence, Prominence, Context, Citation Link, and Competitive Presence. The production code used a four-factor SOV-based formula. Both were defensible, but they did not match. Three production bugs also surfaced: a missing database column, an environment variable mismatch, and a stale model reference.

What is Pondral's AI visibility score?

When Pondral audited itself, its branded mention rate (when people ask AI 'what is Pondral?') was 86%. Its generic mention rate (when people ask AI 'best AI visibility platform') was 0%. This is consistent with a new brand with no Wikipedia entry, no G2 listing, and no third-party press coverage.

How reliable is Pondral's Context grader?

Pondral measured its Context grader's inter-rater reliability at Cohen's kappa 0.947 (near-perfect agreement) using Claude Haiku and Claude Sonnet as independent graders on 30 samples. The grader was also calibrated against a 50-sample human-labeled gold set, achieving 88% exact match and 98% within-one-grade agreement (Pearson r = 0.957).

Does Pondral's scoring formula rank brands correctly?

In a construct validation study across 20 brands spanning household names (Stripe, Figma, Notion) to startups, Pondral's formula achieved a Spearman rank correlation of 0.961. Brands in the highest visibility tier scored 5.36 times higher than brands in the lowest tier on average.

← All postsJournal

We Audited Pondral with Pondral.
Here's What We Found.

Published Jun 21, 2026By Pondral TeamRead time 12 min read

We ran our own AI visibility tool against our own brand, then against 20 other brands with known visibility profiles. Three things came out of it.

1.Our published methodology and our production code didn't match. They do now.
2.Our actual AI visibility on category queries is essentially zero. That's not a failure, it's a measurement. Now we can move it.
3. The reconciled formula ranks brands correctly. Across 20 brands spanning household names to startups no one has heard of, the score tracked real-world visibility with a rank correlation of 0.961.

If you're using an AI visibility tool, including ours, and the vendor can't show you data like this, ask why.

Why we did this

Most B2B SaaS in our category gives customers a number, “your AI visibility score is 47,” without showing how that number was computed, what it correlates with, or whether it would predict the same answer if rerun. We promised something different on /methodology: a public formula, evidence behind every score, and a versioned changelog. By default we grade one response per query and engine and report the mean, with repeated sampling across independent runs available on request.

In May we ran our own audit against ourselves and discovered we hadn't fully kept that promise. Two formulas existed in our system. The one we published was a 5-factor weighted-sum. The one our code actually used was a 4-factor SOV-based formula. They were both reasonable, but they weren't the same. Customers who read our methodology page would have computed a different number than the one we showed them.

This post is the audit trail. The disclosure of what we found, the math we changed, and the data showing the change is real.

What we found

1. Two formulas, one product

The /methodology page described five factors with specific weights:

Factor	Weight
Presence	20%
Prominence	25%
Context	20%
Citation Link	20%
Competitive Presence	15%

That formula lived in the codebase, correctly implemented. But it wasn't what produced your customer score. The number on your dashboard came from a separate aggregator computing: 37% Brand SOV + 33% Generic SOV + 25% Owned Citations + 5% Grounded Mentions.

Both formulas are defensible. Neither was wrong. But they weren't the same, and the gap between them was invisible to customers, and to most of us internally.

2. Our own visibility was barely measurable

When we ran the existing 4-factor formula against pondral.com on 2026-05-05, here's what we got:

86%

Branded mentions

Generic mentions

Not 30%. Not “limited.” Zero. Across 18 attempts spanning ChatGPT, Gemini, Perplexity, and Claude, asking category questions without naming us, we appeared in zero responses.

This is exactly what a brand-new tool with no Wikipedia entry, no G2 listing, and no third-party press should expect. But until we measured it, we were guessing.

3. Three production bugs surfaced during the run

The audit didn't run cleanly the first time. Three real bugs, bugs that had been silently broken in production code, came out:

A missing database column.Our audit pipeline tried to write ‘sentiment’ to a column that didn't exist in any migration. Every audit attempt failed at insert time.

An environment variable name mismatch. Our Gemini provider read GOOGLE_API_KEY while the actual variable was GOOGLE_AI_API_KEY. Every Gemini call silently failed.

A stale model reference.Our Anthropic provider referenced a model name that doesn't exist on their API. Every Claude call returned 404.

We're disclosing these because methodology integrity isn't optional and isn't private. If we hide the boring failures, you have no reason to trust the numbers we publish.

What we changed

Reconciled formula (v2.0.3)

We wired the published 5-factor module into the actual production code path. Now when you see a score, it was computed from the formula on /methodology, not from a different formula behind it. The old 4-component weight constants are deprecated in the codebase.

A reliable Context grader

The 5-factor formula has a “Context” component that grades whether the AI's mention of you is accurate, in the right category, and not damaging. That requires an LLM judge. We needed to measure how reliable that judge is.

0.947

Cohen's kappa

96.7%

Pairwise agreement

0.957

Calibration Pearson r

Two caveats. Both graders are from the same provider (Anthropic), which may inflate agreement. And the 30 samples were hand-selected, not randomly drawn. A cross-provider study (Claude Sonnet vs GPT-4o on 50 representative samples) is on the roadmap.

Construct validation: does the formula actually rank brands correctly?

We tested 20 brands against the reconciled formula. We grouped them into three tiers based on their real-world AI visibility: T1 (household names that AI regularly recommends), T2 (established but not dominant), and T3 (niche or new, rarely mentioned by AI unprompted).

0.961

Spearman rank correlation

5.36x

T1 vs T3 score ratio

Brands tested

Tier	Mean score	Example brands
T1 (high visibility)	53.6	Figma, Stripe, Salesforce, HubSpot, Notion
T2 (moderate)	28.8	Webflow, Brex, Intercom, Airtable
T3 (low/none)	10.0	Cord, Luma AI, Attio, Tercera, Pondral

Two caveats on this run. Context was held at the neutral bucket for all brands because the grader was still in calibration that day. And Gemini was excluded due to quota limits, with Claude partial (rate-limited mid-run). A clean confirmation run with all 5 engines and live Context grading is planned for the next methodology cycle.

All scores in this study reflect publicly observable AI engine responses and do not use any proprietary data from the named companies. Tier assignments are based on frequency and quality of unprompted AI mentions, measured by the same methodology we apply to every Pondral customer.

What this means for your number

We have a near-zero active customer base as of this writing, so a statistical score-shift analysis would be misleading. Instead of pretending: if you are an active customer and your score changed, you'll see the before/after on the methodology changelog at /methodology/changelog. If the change is larger than you expected, contact us. We treat it as an incident.

What we still don't know

1. We haven't proven the score predicts business outcomes. Construct validity (the score reflects known visibility) is not the same as predictive validity (the score predicts pipeline). We don't have the panel data to test that yet. Within the next 90 days we're planning a customer cohort study. We'll publish those results regardless of how they look.

2. AI engines change underneath us. The exact same query can return a different answer next week because the model updated, the index refreshed, or the search tool changed behavior. We track engine version on every score. Drift greater than 5% on rerun against the same engine version is treated as a methodology incident.

3. Our competitor set per customer is editable. The Competitive Presence factor depends on who we list as your competitors. That's editable in your dashboard, which means scores are partially shaped by your input. We disclose this in every report.

What you can do with this

If you're a Pondral customer: Read the methodology changelog for the v2.0.3 entry. The 20-brand construct validation data and methodology are public. Check it yourself. If your specific score moved more than you expected, contact us.

If you're evaluating AI visibility tools: Ask each vendor for the formula. The exact one. Not “we use a proprietary methodology.” Ask for the inter-rater reliability number. If they can't compute one, the score isn't reproducible. Ask for construct validation data.

If you're another B2B SaaS founder: the cost of writing this post is publishing your gaps before someone else does. That cost is much smaller than the cost of methodology integrity disclosure done in crisis mode after a customer or competitor finds the gap first.

What's next

The cohort outcome study runs through 2026-09. We'll publish results in October regardless of whether they confirm or contradict the formula's predictive validity. The methodology changelog at /methodology/changelog is the canonical source for any future change.

If you want to follow methodology updates, follow Pondral on LinkedIn.

-- Philipp Groubii, Founder, Pondral

← AEO Methodology Explained

Last updated June 2026Run a free audit →

We Audited Pondral with Pondral.Here's What We Found.