research withRon

Original data

We asked four AI engines to recommend products 96 times. They barely agreed.

An original test across 8 buyer categories and 4 AI engines. The headline: there is no single answer to hold a position in, and the same engine often contradicts itself minutes apart.

Harsh Rana·June 23, 2026·7 min read

The short answer

Across 8 product categories, ChatGPT, Claude, Gemini, and Perplexity agreed on a single best pick in only one. The same engine, asked the same question three times, changed its top recommendation in 47% of cases.

7 of 8

categories where the four AI engines could not agree on one best pick

Everyone in AEO talks about getting recommended by ChatGPT as if there is one answer to win, the way there was one blue link at position one. So we tested that assumption directly. We asked four AI engines to recommend products in eight common categories, three separate times each, and we wrote down who they named and in what order.

The short version: there is no single answer to win. The engines disagree with each other, and they disagree with themselves. That changes what visibility even means, and it is the reason a one-time check of a single model tells you almost nothing.

How we ran it

We kept the method deliberately plain so the results are easy to trust and easy to repeat. No prompt engineering tricks, just the kind of question a real buyer types.

Finding one: the engines rarely agree on a winner

We looked at each engine's most common first pick per category, then checked whether all four lined up. In seven of the eight categories, they did not. The only clean agreement was team password managers, where all four led with 1Password.

CategoryChatGPTClaudeGeminiPerplexity
Project managementTeamworkmonday.comAsanamonday.com
Email marketingActiveCampaignActiveCampaignActiveCampaignKlaviyo
Small-business CRMBigin (Zoho)Bigin (Zoho)Bigin (Zoho)monday.com
Web analyticsGoogle AnalyticsContentsquareGoogle AnalyticsGoogle Analytics
AI writing assistantsClaudeClaudeClaudeGrammarly
Help deskHaloITSMZendeskHaloITSMZoho Desk
Team password managers1Password1Password1Password1Password
Social schedulingHootsuitePallyyPallyyBuffer

Even ChatGPT and Claude, the two most used engines, agreed on the top pick in only four of the eight categories. If your AEO plan is built around one model, you are optimizing for one opinion out of at least four, and the other three are sending buyers somewhere else.

There is no position one to hold. There are four different answers, and they are all live at the same time.

Finding two: ask the same engine twice, get two answers

This is the part that should retire the one-time screenshot. We asked each engine the identical question three times. In 47% of engine and category pairs, the top recommendation changed at least once across those three runs. Same model, same prompt, a different favorite a minute later.

47%

of the time, an engine changed its number one pick across three identical runs

So when someone says they checked and ChatGPT recommends them, the honest follow-up is: in which of the runs? A single look is a coin flip dressed up as a finding. You need repeats across engines to see a real pattern, which is exactly the part a manual check almost never does.

Finding three: getting mentioned is not getting recommended

The most useful split in the data is between brands that get named a lot and brands that actually get picked. They are not the same list, and the gap is where most companies lose.

In the small-business CRM category, HubSpot was named in 11 of 12 answers, more than almost any other brand. It was the top pick for none of the four engines. They led with Bigin and monday.com instead. Being everywhere in the answer is not the same as being the recommendation.

The same shape showed up elsewhere. ClickUp and Asana were named in every single project-management answer, yet neither held the top spot consistently across engines. Matomo appeared in all twelve web-analytics answers and still was not any engine's lead pick. Across the study, engines named about nineteen distinct brands per category, so a mention is cheap. The first slot is the scarce one, and it is the one buyers act on.

What this means if you sell something

Three practical takeaways fall out of this, and none of them are buy a monthly dashboard.

  1. Stop optimizing for one model. You need to know what all four say, because they disagree often enough that one of them is the wrong sample.
  2. Stop trusting a single check. Volatility is real and large. Look across repeat runs or you are reading noise.
  3. Aim for the top pick, not just a mention. Showing up in the list is table stakes. The category leaders own the first slot, and that is a different and harder job than getting named.

If you want to run a rough version of this yourself, the free prompt pack generator builds the buyer questions to test, grouped by intent. Paste them into each engine, run each a few times, and note where you appear and where you do not. It will not be as clean as a controlled study, but it will tell you in ten minutes whether you have a problem worth taking seriously.

And when you want the controlled version, that is what the audit is. We run a tuned set of prompts across ChatGPT, Claude, and Gemini, score who gets recommended over you, and hand back the fixes ranked by impact. This study is the method, pointed at your business instead of eight generic categories.

Questions

Is this study big enough to be definitive?

No, and we will not pretend otherwise. Eight categories and 96 answers is a probe, not a census. But the effects are large and consistent enough that the direction is hard to argue with: the engines disagree with each other and with themselves, a lot. A bigger sample would sharpen the percentages, not reverse them.

Why do the engines disagree so much?

They draw on different training data, different live web sources, and different ranking logic, and they generate probabilistically rather than deterministically. That last part is why the same prompt can yield a different favorite twice in a row. It is a property of how these models work, not a glitch.

Does this mean AI visibility is hopeless to influence?

The opposite. It means the game is winnable but noisy, so you measure it properly instead of eyeballing one answer. The brands that consistently took the top slot were not random. They had strong, well-structured presence across the sources these engines read. That is influenceable.

How is this different from a brand mention tracker?

Most trackers count mentions and share of voice. This looks at recommendation: who gets named first when a buyer asks what to use. As the HubSpot example shows, those are different questions, and the recommendation one is the one tied to a purchase.

R

Harsh Rana

I build Ron at 617 Software Studio, a small Boston shop. I run real AI visibility audits by hand and pour what I learn into how Ron works. These notes come from the actual reports, not a content brief. More about Ron.

Keep going

Sources

Find out what AI actually says about you.

~5 min scan · $39 · refunds if useless

Run my audit →