Is this study big enough to be definitive?

No, and we will not pretend otherwise. Eight categories and 96 answers is a probe, not a census. But the effects are large and consistent enough that the direction is hard to argue with: the engines disagree with each other and with themselves, a lot. A bigger sample would sharpen the percentages, not reverse them.

Why do the engines disagree so much?

They draw on different training data, different live web sources, and different ranking logic, and they generate probabilistically rather than deterministically. That last part is why the same prompt can yield a different favorite twice in a row. It is a property of how these models work, not a glitch.

Does this mean AI visibility is hopeless to influence?

The opposite. It means the game is winnable but noisy, so you measure it properly instead of eyeballing one answer. The brands that consistently took the top slot were not random. They had strong, well-structured presence across the sources these engines read. That is influenceable.

How is this different from a brand mention tracker?

Most trackers count mentions and share of voice. This looks at recommendation: who gets named first when a buyer asks what to use. As the HubSpot example shows, those are different questions, and the recommendation one is the one tied to a purchase.

What AI Engines Recommend: A 96-Answer Study

An original test across 8 buyer categories and 4 AI engines. The headline: there is no single answer to hold a position in, and the same engine often contradicts itself minutes apart.

Everyone in AEO talks about getting recommended by ChatGPT as if there is one answer to win, the way there was one blue link at position one. So we tested that assumption directly. We asked four AI engines to recommend products in eight common categories, three separate times each, and we wrote down who they named and in what order.

The short version: there is no single answer to win. The engines disagree with each other, and they disagree with themselves. That changes what visibility even means, and it is the reason a one-time check of a single model tells you almost nothing.

How we ran it

We kept the method deliberately plain so the results are easy to trust and easy to repeat. No prompt engineering tricks, just the kind of question a real buyer types.

✓Eight categories a normal business actually shops for: project management, email marketing, small-business CRM, web analytics, AI writing assistants, help desk, team password managers, and social scheduling.
✓Four web-connected engines: ChatGPT (GPT-4o), Claude (Sonnet 4.5), Gemini (2.5 Flash), and Perplexity (Sonar).
✓One prompt per category, asked three times on each engine: what are the best [category], give me your top recommendations.
✓That is 96 answers in total. We extracted the named products from each, in order, and compared the top pick across engines and across repeat runs.

Finding one: the engines rarely agree on a winner

We looked at each engine's most common first pick per category, then checked whether all four lined up. In seven of the eight categories, they did not. The only clean agreement was team password managers, where all four led with 1Password.

Category	ChatGPT	Claude	Gemini	Perplexity
Project management	Teamwork	monday.com	Asana	monday.com
Email marketing	ActiveCampaign	ActiveCampaign	ActiveCampaign	Klaviyo
Small-business CRM	Bigin (Zoho)	Bigin (Zoho)	Bigin (Zoho)	monday.com
Web analytics	Google Analytics	Contentsquare	Google Analytics	Google Analytics
AI writing assistants	Claude	Claude	Claude	Grammarly
Help desk	HaloITSM	Zendesk	HaloITSM	Zoho Desk
Team password managers	1Password	1Password	1Password	1Password
Social scheduling	Hootsuite	Pallyy	Pallyy	Buffer

Even ChatGPT and Claude, the two most used engines, agreed on the top pick in only four of the eight categories. If your AEO plan is built around one model, you are optimizing for one opinion out of at least four, and the other three are sending buyers somewhere else.

There is no position one to hold. There are four different answers, and they are all live at the same time.

Finding two: ask the same engine twice, get two answers

This is the part that should retire the one-time screenshot. We asked each engine the identical question three times. In 47% of engine and category pairs, the top recommendation changed at least once across those three runs. Same model, same prompt, a different favorite a minute later.

47%

of the time, an engine changed its number one pick across three identical runs

So when someone says they checked and ChatGPT recommends them, the honest follow-up is: in which of the runs? A single look is a coin flip dressed up as a finding. You need repeats across engines to see a real pattern, which is exactly the part a manual check almost never does.

Finding three: getting mentioned is not getting recommended

The most useful split in the data is between brands that get named a lot and brands that actually get picked. They are not the same list, and the gap is where most companies lose.

In the small-business CRM category, HubSpot was named in 11 of 12 answers, more than almost any other brand. It was the top pick for none of the four engines. They led with Bigin and monday.com instead. Being everywhere in the answer is not the same as being the recommendation.

The same shape showed up elsewhere. ClickUp and Asana were named in every single project-management answer, yet neither held the top spot consistently across engines. Matomo appeared in all twelve web-analytics answers and still was not any engine's lead pick. Across the study, engines named about nineteen distinct brands per category, so a mention is cheap. The first slot is the scarce one, and it is the one buyers act on.

What this means if you sell something

Three practical takeaways fall out of this, and none of them are buy a monthly dashboard.

Stop optimizing for one model. You need to know what all four say, because they disagree often enough that one of them is the wrong sample.
Stop trusting a single check. Volatility is real and large. Look across repeat runs or you are reading noise.
Aim for the top pick, not just a mention. Showing up in the list is table stakes. The category leaders own the first slot, and that is a different and harder job than getting named.

If you want to run a rough version of this yourself, the free prompt pack generator builds the buyer questions to test, grouped by intent. Paste them into each engine, run each a few times, and note where you appear and where you do not. It will not be as clean as a controlled study, but it will tell you in ten minutes whether you have a problem worth taking seriously.

And when you want the controlled version, that is what the audit is. We run a tuned set of prompts across ChatGPT, Claude, and Gemini, score who gets recommended over you, and hand back the fixes ranked by impact. This study is the method, pointed at your business instead of eight generic categories.

We asked four AI engines to recommend products 96 times. They barely agreed.

How we ran it

Finding one: the engines rarely agree on a winner

Finding two: ask the same engine twice, get two answers

Finding three: getting mentioned is not getting recommended

What this means if you sell something

Questions