We ran our own AI visibility tool against our own brand, then against 20 other brands with known visibility profiles. Three things came out of it.
1.Our published methodology and our production code didn't match. They do now.
2.Our actual AI visibility on category queries is essentially zero. That's not a failure, it's a measurement. Now we can move it.
3. The reconciled formula ranks brands correctly. Across 20 brands spanning household names to startups no one has heard of, the score tracked real-world visibility with a rank correlation of 0.961.
If you're using an AI visibility tool, including ours, and the vendor can't show you data like this, ask why.
Why we did this
Most B2B SaaS in our category gives customers a number, “your AI visibility score is 47,” without showing how that number was computed, what it correlates with, or whether it would predict the same answer if rerun. We promised something different on /methodology: a public formula, evidence behind every score, and a versioned changelog. By default we grade one response per query and engine and report the mean, with repeated sampling across independent runs available on request.
In May we ran our own audit against ourselves and discovered we hadn't fully kept that promise. Two formulas existed in our system. The one we published was a 5-factor weighted-sum. The one our code actually used was a 4-factor SOV-based formula. They were both reasonable, but they weren't the same. Customers who read our methodology page would have computed a different number than the one we showed them.
This post is the audit trail. The disclosure of what we found, the math we changed, and the data showing the change is real.
What we found
1. Two formulas, one product
The /methodology page described five factors with specific weights:
| Factor | Weight |
|---|---|
| Presence | 20% |
| Prominence | 25% |
| Context | 20% |
| Citation Link | 20% |
| Competitive Presence | 15% |
That formula lived in the codebase, correctly implemented. But it wasn't what produced your customer score. The number on your dashboard came from a separate aggregator computing: 37% Brand SOV + 33% Generic SOV + 25% Owned Citations + 5% Grounded Mentions.
Both formulas are defensible. Neither was wrong. But they weren't the same, and the gap between them was invisible to customers, and to most of us internally.
2. Our own visibility was barely measurable
When we ran the existing 4-factor formula against pondral.com on 2026-05-05, here's what we got:
Not 30%. Not “limited.” Zero. Across 18 attempts spanning ChatGPT, Gemini, Perplexity, and Claude, asking category questions without naming us, we appeared in zero responses.
This is exactly what a brand-new tool with no Wikipedia entry, no G2 listing, and no third-party press should expect. But until we measured it, we were guessing.
3. Three production bugs surfaced during the run
The audit didn't run cleanly the first time. Three real bugs, bugs that had been silently broken in production code, came out:
A missing database column.Our audit pipeline tried to write ‘sentiment’ to a column that didn't exist in any migration. Every audit attempt failed at insert time.
An environment variable name mismatch. Our Gemini provider read GOOGLE_API_KEY while the actual variable was GOOGLE_AI_API_KEY. Every Gemini call silently failed.
A stale model reference.Our Anthropic provider referenced a model name that doesn't exist on their API. Every Claude call returned 404.
We're disclosing these because methodology integrity isn't optional and isn't private. If we hide the boring failures, you have no reason to trust the numbers we publish.
What we changed
Reconciled formula (v2.0.0)
We wired the published 5-factor module into the actual production code path. Now when you see a score, it was computed from the formula on /methodology, not from a different formula behind it. The old 4-component weight constants are deprecated in the codebase.
A reliable Context grader
The 5-factor formula has a “Context” component that grades whether the AI's mention of you is accurate, in the right category, and not damaging. That requires an LLM judge. We needed to measure how reliable that judge is.
Two caveats. Both graders are from the same provider (Anthropic), which may inflate agreement. And the 30 samples were hand-selected, not randomly drawn. A cross-provider study (Claude Sonnet vs GPT-4o on 50 representative samples) is on the roadmap.
Construct validation: does the formula actually rank brands correctly?
We tested 20 brands against the reconciled formula. We grouped them into three tiers based on their real-world AI visibility: T1 (household names that AI regularly recommends), T2 (established but not dominant), and T3 (niche or new, rarely mentioned by AI unprompted).
| Tier | Mean score | Example brands |
|---|---|---|
| T1 (high visibility) | 53.6 | Figma, Stripe, Salesforce, HubSpot, Notion |
| T2 (moderate) | 28.8 | Webflow, Brex, Intercom, Airtable |
| T3 (low/none) | 10.0 | Cord, Luma AI, Attio, Tercera, Pondral |
Two caveats on this run. Context was held at the neutral bucket for all brands because the grader was still in calibration that day. And Gemini was excluded due to quota limits, with Claude partial (rate-limited mid-run). A clean confirmation run with all 5 engines and live Context grading is in progress.
What this means for your number
We have a near-zero active customer base as of this writing, so a statistical score-shift analysis would be misleading. Instead of pretending: if you are an active customer and your score changed, you'll see the before/after on the methodology changelog at /methodology/changelog. If the change is larger than you expected, contact us. We treat it as an incident.
What we still don't know
1. We haven't proven the score predicts business outcomes. Construct validity (the score reflects known visibility) is not the same as predictive validity (the score predicts pipeline). We don't have the panel data to test that yet. Within the next 90 days we're planning a customer cohort study. We'll publish those results regardless of how they look.
2. AI engines change underneath us. The exact same query can return a different answer next week because the model updated, the index refreshed, or the search tool changed behavior. We track engine version on every score. Drift greater than 5% on rerun against the same engine version is treated as a methodology incident.
3. Our competitor set per customer is editable. The Competitive Presence factor depends on who we list as your competitors. That's editable in your dashboard, which means scores are partially shaped by your input. We disclose this in every report.
What you can do with this
If you're a Pondral customer: Read the methodology changelog for the v1.3.0 to v2.0.0 entry. The 20-brand construct validation data and methodology are public. Check it yourself. If your specific score moved more than you expected, contact us.
If you're evaluating AI visibility tools: Ask each vendor for the formula. The exact one. Not “we use a proprietary methodology.” Ask for the inter-rater reliability number. If they can't compute one, the score isn't reproducible. Ask for construct validation data.
If you're another B2B SaaS founder: the cost of writing this post is publishing your gaps before someone else does. That cost is much smaller than the cost of methodology integrity disclosure done in crisis mode after a customer or competitor finds the gap first.
What's next
The cohort outcome study runs through 2026-09. We'll publish results in October regardless of whether they confirm or contradict the formula's predictive validity. The methodology changelog at /methodology/changelog is the canonical source for any future change.
If you want to follow methodology updates, follow Pondral on LinkedIn.
-- Philipp Groubii, Founder, Pondral