ProductApril 202612 min read

How We Built Pondral's Scoring Rubric: Design Decisions and Trade-offs

Building an AI visibility score that is accurate, reproducible, and transparent required making dozens of design decisions. Here's why we made the ones we did.

The problem with existing approaches

When we started building Pondral, the AI visibility tools that existed used one of two approaches: binary (mentioned/not mentioned) or opaque (a score from an undocumented algorithm). Neither was good enough.

Binary measurement tells you whether you're in the conversation but nothing about the quality of that presence. Being mentioned in a negative context, or mentioned last in a list of ten competitors, registers the same as being the first-recommended brand with a citation link. That is not a useful signal for making optimization decisions.

Opaque scores have the opposite problem. They provide a number, but without knowing what the number measures, you cannot act on it. If your score drops from 72 to 65, you need to know whether that is because you lost citation links, your sentiment changed, or a competitor gained share. A black-box score hides the actionable detail.

We designed Pondral's 5-factor scoring rubric to solve both problems: multi-dimensional (not binary) and fully transparent (not a black box).

Why five factors, and why these five

We tested over a dozen candidate factors before settling on five. The selection criteria were: each factor must be independently measurable, actionable (you can do something to improve it), and meaningful (it correlates with real-world outcomes like click-through or brand perception).

The five factors we selected are Presence (20%), Prominence (25%), Context (20%), Citation Link (20%), and Competitive Share (15%). Each measures a distinct dimension of AI visibility, and together they capture the full picture of how well a brand is represented in an AI-generated response. For a plain-language walkthrough of each, see the five-factor rubric, explained slowly.

Factors we considered and rejected include response length (too noisy, varies by engine), source count (measures the AI's behavior, not your brand's presence), and query-level relevance (already implicit in query selection, not a per-response metric).

Why Prominence gets the highest weight

Prominence — where in the response your brand appears — is weighted at 25%, more than any other factor. This was a deliberate decision based on backtesting.

When we analyzed click-through data from AI responses, we found that brands mentioned in the first sentence of a Perplexity answer drive approximately 3.5x the click-through of brands mentioned in the last sentence. For ChatGPT, the ratio is approximately 2.8x. Position in the response is the single strongest predictor of whether a user actually engages with the cited brand.

This mirrors research on traditional search results: the first organic result gets 10x the clicks of the tenth. In AI responses, the effect is less extreme but still significant. Being mentioned first is measurably more valuable than being mentioned at all.

Ordinal buckets, not continuous floats

Each factor is scored into one of five buckets: 0, 25, 50, 75, 100. We chose ordinal buckets over continuous floats because of how we measure: an LLM rater evaluates each response against the rubric.

LLM judges are reliable at ordinal grading (“this is better than that”) but unreliable at fine-grained regression (“this deserves a 73, not a 76”). Using continuous floats would create false precision — the rater would generate numbers that look precise but aren't reproducible. Ordinal buckets match the resolution the underlying signal actually supports.

This decision also makes scores more interpretable. A Prominence score of 75 means the brand appeared early in the response. A score of 25 means it appeared late. There are no arbitrary decimal points to interpret or explain.

t-distribution, not z-scores

Every score in Pondral is the average of at least three independent runs. We report 95% confidence intervals using the t-distribution at n−1 degrees of freedom.

This is a deliberate deviation from tools that use z-scores (the normal distribution). At small sample sizes (n = 3 to n = 5), the normal distribution understates uncertainty by roughly 12%, based on statistical literature for these sample sizes. The t-distribution produces wider, more honest confidence intervals that accurately reflect how much you should trust the number.

We also chose to display the confidence interval directly on the dashboard rather than hiding it behind a hover tooltip. When your score is 68 ± 14, you should see the ± 14 before you act on it. If the interval is wide, wait for more data. If it is tight, act with confidence.

Separate rater for Context

The Context factor (positive, neutral, or negative sentiment) is graded by a separate LLM rater — not the same engine that generated the original response. This avoids self-assessment bias, where an engine might evaluate its own output more favorably than an independent judge would.

We use inter-rater reliability checks: when two rater LLMs disagree by more than one bucket (e.g., one says 25 and the other says 75), we surface the disagreement on the dashboard and flag the cell for human review. We do not silently average away the conflict. Disagreements often signal edge cases — responses where sentiment is genuinely ambiguous — and those edge cases are valuable information.

Transparency as a design principle

Every score on the Pondral dashboard has a “View raw” button. It returns the exact prompt sent to the AI engine, the full response text, the timestamp, the rater model and version, and the factor-by-factor breakdown. You can replay the query externally and verify our scoring yourself.

This was a non-negotiable design decision. In a market where most tools produce scores from undocumented algorithms, we believe full reproducibility is the only way to earn trust. If our score doesn't match your replay, that is a bug in our system, not a feature.

For the full technical specification, read our methodology page, which documents every formula, weight, and threshold in the scoring system.

What we got wrong and changed

Building the scoring rubric was iterative. Some decisions we changed along the way:

Original Competitive Share weight: 25%.We initially weighted Competitive Share higher, but found it was too volatile. A competitor's absence from a single run inflated scores without any action from the brand being measured. We reduced it to 15% and shifted weight to Prominence, which proved more stable and more predictive.

Three-bucket scoring.Our first version used three buckets (0, 50, 100). This was too coarse — there was no way to distinguish between a brand mentioned in the second sentence and one mentioned in the second-to-last paragraph. Five buckets provided the resolution raters could support reliably.

Single rater for all factors. Using a single rater introduced self-assessment bias on the Context factor. Splitting Context to a separate rater improved inter-rater agreement from ~72% to ~89% in our internal testing.

Try it yourself

The best way to understand the scoring rubric is to see it applied to your brand. Run a free visibility check on Pondral's homepage and see your factor-by-factor breakdown across every major AI engine.

PG

Philipp GroubiiFounder, Pondral

Philipp builds tools that help brands understand and improve their AI visibility. Background in SEO strategy, digital marketing, and SaaS product development. LinkedIn →