Everyone in AEO talks about getting recommended by ChatGPT as if there is one answer to win, the way there was one blue link at position one. So we tested that assumption directly. We asked four AI engines to recommend products in eight common categories, three separate times each, and we wrote down who they named and in what order.
The short version: there is no single answer to win. The engines disagree with each other, and they disagree with themselves. That changes what visibility even means, and it is the reason a one-time check of a single model tells you almost nothing.
How we ran it
We kept the method deliberately plain so the results are easy to trust and easy to repeat. No prompt engineering tricks, just the kind of question a real buyer types.
- ✓Eight categories a normal business actually shops for: project management, email marketing, small-business CRM, web analytics, AI writing assistants, help desk, team password managers, and social scheduling.
- ✓Four web-connected engines: ChatGPT (GPT-4o), Claude (Sonnet 4.5), Gemini (2.5 Flash), and Perplexity (Sonar).
- ✓One prompt per category, asked three times on each engine: what are the best [category], give me your top recommendations.
- ✓That is 96 answers in total. We extracted the named products from each, in order, and compared the top pick across engines and across repeat runs.
Finding one: the engines rarely agree on a winner
We looked at each engine's most common first pick per category, then checked whether all four lined up. In seven of the eight categories, they did not. The only clean agreement was team password managers, where all four led with 1Password.
| Category | ChatGPT | Claude | Gemini | Perplexity |
|---|---|---|---|---|
| Project management | Teamwork | monday.com | Asana | monday.com |
| Email marketing | ActiveCampaign | ActiveCampaign | ActiveCampaign | Klaviyo |
| Small-business CRM | Bigin (Zoho) | Bigin (Zoho) | Bigin (Zoho) | monday.com |
| Web analytics | Google Analytics | Contentsquare | Google Analytics | Google Analytics |
| AI writing assistants | Claude | Claude | Claude | Grammarly |
| Help desk | HaloITSM | Zendesk | HaloITSM | Zoho Desk |
| Team password managers | 1Password | 1Password | 1Password | 1Password |
| Social scheduling | Hootsuite | Pallyy | Pallyy | Buffer |
Even ChatGPT and Claude, the two most used engines, agreed on the top pick in only four of the eight categories. If your AEO plan is built around one model, you are optimizing for one opinion out of at least four, and the other three are sending buyers somewhere else.
There is no position one to hold. There are four different answers, and they are all live at the same time.
Finding two: ask the same engine twice, get two answers
This is the part that should retire the one-time screenshot. We asked each engine the identical question three times. In 47% of engine and category pairs, the top recommendation changed at least once across those three runs. Same model, same prompt, a different favorite a minute later.
47%
of the time, an engine changed its number one pick across three identical runs
So when someone says they checked and ChatGPT recommends them, the honest follow-up is: in which of the runs? A single look is a coin flip dressed up as a finding. You need repeats across engines to see a real pattern, which is exactly the part a manual check almost never does.
Finding three: getting mentioned is not getting recommended
The most useful split in the data is between brands that get named a lot and brands that actually get picked. They are not the same list, and the gap is where most companies lose.
In the small-business CRM category, HubSpot was named in 11 of 12 answers, more than almost any other brand. It was the top pick for none of the four engines. They led with Bigin and monday.com instead. Being everywhere in the answer is not the same as being the recommendation.
The same shape showed up elsewhere. ClickUp and Asana were named in every single project-management answer, yet neither held the top spot consistently across engines. Matomo appeared in all twelve web-analytics answers and still was not any engine's lead pick. Across the study, engines named about nineteen distinct brands per category, so a mention is cheap. The first slot is the scarce one, and it is the one buyers act on.
What this means if you sell something
Three practical takeaways fall out of this, and none of them are buy a monthly dashboard.
- Stop optimizing for one model. You need to know what all four say, because they disagree often enough that one of them is the wrong sample.
- Stop trusting a single check. Volatility is real and large. Look across repeat runs or you are reading noise.
- Aim for the top pick, not just a mention. Showing up in the list is table stakes. The category leaders own the first slot, and that is a different and harder job than getting named.
If you want to run a rough version of this yourself, the free prompt pack generator builds the buyer questions to test, grouped by intent. Paste them into each engine, run each a few times, and note where you appear and where you do not. It will not be as clean as a controlled study, but it will tell you in ten minutes whether you have a problem worth taking seriously.
And when you want the controlled version, that is what the audit is. We run a tuned set of prompts across ChatGPT, Claude, and Gemini, score who gets recommended over you, and hand back the fixes ranked by impact. This study is the method, pointed at your business instead of eight generic categories.