Here is a situation that plays out constantly. A developer blocks all unknown bots after a traffic spike. Or a WordPress security plugin adds a blanket Disallow rule. Or someone copies a robots.txt template from 2019 that predates AI crawlers entirely. The site looks fine in Google Search Console. Organic traffic holds steady. But quietly, the site has vanished from every AI-generated answer.
The robots.txt file is small, easy to overlook, and now more consequential than it has been in years. This post walks through every major AI user-agent you need to know, explains the difference between training crawlers and search-and-answer crawlers (the distinction almost everyone gets wrong), and shows you the exact fix.
The complete AI crawler list, grouped by vendor
Before you can audit your robots.txt, you need to know which user-agents to look for. The list has grown fast. Here is the full set worth tracking as of mid-2026.
| User-Agent | Vendor | Purpose | Type |
|---|---|---|---|
| GPTBot | OpenAI | Crawls pages to train future GPT models | Training |
| OAI-SearchBot | OpenAI | Builds the live search index powering ChatGPT search | Search / Answer |
| ChatGPT-User | OpenAI | Fetches URLs in real time when a user pastes a link in ChatGPT | Live Fetch / Answer |
| ClaudeBot | Anthropic | Crawls pages to train Claude models | Training |
| Claude-SearchBot | Anthropic | Crawls for Claude's live web-grounded answers | Search / Answer |
| anthropic-ai | Anthropic | Legacy Anthropic crawler identifier (older crawl runs) | Training |
| Google-Extended | Gemini and Vertex AI training opt-out token | Training | |
| PerplexityBot | Perplexity | Indexes pages for Perplexity's answer engine | Search / Answer |
| Perplexity-User | Perplexity | Fetches URLs live during a Perplexity session | Live Fetch / Answer |
| CCBot | Common Crawl | Open web crawl used as training data by many AI labs | Training |
| Bytespider | ByteDance | Crawls for TikTok and ByteDance AI products | Training / Mixed |
| Amazonbot | Amazon | Crawls for Alexa and Amazon AI features | Training / Mixed |
| Applebot-Extended | Apple | Apple Intelligence and Siri training opt-out token | Training |
| Meta-ExternalAgent | Meta | Crawls for Meta AI training and product features | Training / Mixed |
The one distinction that actually matters: training vs. search-and-answer
Most of the confusion around AI crawler blocking comes from treating all AI bots as the same thing. They are not. There are two fundamentally different jobs these crawlers do, and blocking them has opposite consequences.
Training crawlers
GPTBot, Google-Extended, ClaudeBot, CCBot, and most of the others in the list above are training crawlers. They collect your content so it can be used to train or fine-tune a model. Blocking them is a legitimate, widely-accepted choice. You opt your content out of the training corpus. The model does not learn from your pages. That is the full consequence. The model still exists, still answers questions, and still cites other sources. You just are not in the training data.
Search and answer crawlers
OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot, and Perplexity-User work differently. These bots build or refresh a live retrieval index. When a user asks ChatGPT a question and the answer cites a source, that source was found by OAI-SearchBot. When Perplexity quotes your article, PerplexityBot crawled it. Block these bots and you do not just opt out of training. You disappear from live answers entirely.
Blocking GPTBot is a training opt-out. Blocking OAI-SearchBot is a visibility opt-out. Most people do not realize those are two different robots.
A Hostinger study cited by Search Engine Journal found that OpenAI's OAI-SearchBot had already surpassed 55% crawl coverage across tracked sites. That is a substantial footprint for a crawler that did not exist a couple of years ago. It means the bot is actively visiting a majority of sites, and any Disallow rule it hits produces a real, immediate gap in ChatGPT's search results.
How to check your robots.txt right now
Your robots.txt lives at yourdomain.com/robots.txt. Open it in a browser. You are looking for two things.
- A wildcard block that catches everything: a User-agent: * line followed by Disallow: / or a broad path. If this exists without explicit Allow rules for specific bots above it, every crawler including AI bots is blocked.
- Named bot blocks: search the file for GPTBot, OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot, and Claude-SearchBot. Any Disallow: / under one of these names blocks that specific bot.
- Order sensitivity: a more specific user-agent block overrides the wildcard for that bot. So even if User-agent: * allows everything, an explicit User-agent: OAI-SearchBot with Disallow: / underneath kills that crawler.
- Empty or missing file: a missing robots.txt means no restrictions at all, which is generally fine. The problems come from files with rules.
Common ways sites end up accidentally blocked
- ✓Security plugins like Wordfence or iThemes Security on WordPress sometimes add aggressive Disallow rules for unknown bots by default.
- ✓CDN or WAF configurations (Cloudflare, Sucuri) that block user-agents not on an allowlist. AI crawlers are new enough that many default lists do not include them.
- ✓A template robots.txt copied from a high-traffic site that deliberately blocks all third-party crawlers.
- ✓A developer testing a staging block who published the robots.txt to production by mistake.
- ✓Rate-limiting rules that send 429s to any bot hitting more than a few pages per minute. Robots.txt technically allows the crawl but the server refuses it in practice.
The corrected robots.txt snippet
If you want to block training crawlers but stay visible in AI-generated answers, here is the explicit configuration. Drop this into your robots.txt above your wildcard block so the named rules take priority.
robots.txt: allow search-and-answer bots, block training bots
# Allow the search/answer bots that power live AI responses User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Claude-SearchBot Allow: / # Block training-only crawlers (optional, legitimate choice) User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / # Your existing wildcard rule below this line User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
If you want full AI visibility with no restrictions, simply remove all the named bot blocks above and let the wildcard rule handle everything. The explicit Allow rules are only needed when you have a Disallow somewhere that would otherwise catch these bots.
The trade-off worth thinking through
Blocking training crawlers is not a neutral act, but it is not a harmful one either. You are choosing not to contribute your content to someone else's training corpus. Your site stays out of the model weights. That is a reasonable preference for many publishers.
The argument for allowing training crawlers is less about direct visibility and more about indirect influence. If your content shapes the model, the model's general knowledge reflects your framing, your terminology, your perspective. Research on generative engine optimization (Aggarwal et al., arXiv:2311.09735) suggests that being part of training data can influence how a model discusses related topics even without citing you directly. Whether that is worth it for your site is a judgment call.
The argument against blocking search-and-answer crawlers is much simpler: there is no upside. These bots fetch publicly available pages and surface them in answer results, the same way Google's crawler does for web search. Blocking them removes a distribution channel with no compensating benefit.
55%+
Crawl coverage already reached by OAI-SearchBot in a Hostinger study (Search Engine Journal)
One silent failure worth knowing about
Unlike a Google penalty, which eventually shows up in Search Console as a manual action or a traffic drop you can investigate, a robots.txt block for AI crawlers produces no alert. ChatGPT does not send you an email. Perplexity does not log a warning in any dashboard you can access. Your pages simply stop appearing in AI-generated answers, and unless you are actively testing your own site in those tools, you will not notice.
That is the core reason to run an audit now rather than assuming your current configuration is fine. The defaults from 2022 or 2023 almost certainly did not account for OAI-SearchBot or PerplexityBot. They did not exist yet.
What to do next
Start by pulling up your robots.txt. Scan it for the user-agents in the table above. If you find a Disallow on any of the search-and-answer bots, fix it using the snippet above. If the file looks clean at the robots.txt level, check your server-side WAF rules and any security plugins for bot filtering logic that might be operating separately.
If you want a faster read without digging through files manually, the free AI Crawler Checker tool linked below fetches your robots.txt and highlights any rules affecting known AI crawlers. It takes about ten seconds and flags training-only blocks separately from search-and-answer blocks so you know exactly what you are dealing with.