research withRon

Technical AEO

Is Your Site Blocking GPTBot? How to Check robots.txt for AI Crawlers

A single accidental Disallow can erase you from ChatGPT, Perplexity, and Gemini answers without any warning.

Harsh Rana·May 21, 2026·8 min read

The short answer

To check if your site blocks AI crawlers, fetch your robots.txt and look for Disallow rules targeting GPTBot, OAI-SearchBot, PerplexityBot, or ClaudeBot, because blocking the search-and-answer bots specifically removes your pages from live AI responses.

55%+

Of websites crawled by OpenAI's OAI-SearchBot in a Hostinger study, showing how aggressively search-answer bots now index the web

Here is a situation that plays out constantly. A developer blocks all unknown bots after a traffic spike. Or a WordPress security plugin adds a blanket Disallow rule. Or someone copies a robots.txt template from 2019 that predates AI crawlers entirely. The site looks fine in Google Search Console. Organic traffic holds steady. But quietly, the site has vanished from every AI-generated answer.

The robots.txt file is small, easy to overlook, and now more consequential than it has been in years. This post walks through every major AI user-agent you need to know, explains the difference between training crawlers and search-and-answer crawlers (the distinction almost everyone gets wrong), and shows you the exact fix.

The complete AI crawler list, grouped by vendor

Before you can audit your robots.txt, you need to know which user-agents to look for. The list has grown fast. Here is the full set worth tracking as of mid-2026.

User-AgentVendorPurposeType
GPTBotOpenAICrawls pages to train future GPT modelsTraining
OAI-SearchBotOpenAIBuilds the live search index powering ChatGPT searchSearch / Answer
ChatGPT-UserOpenAIFetches URLs in real time when a user pastes a link in ChatGPTLive Fetch / Answer
ClaudeBotAnthropicCrawls pages to train Claude modelsTraining
Claude-SearchBotAnthropicCrawls for Claude's live web-grounded answersSearch / Answer
anthropic-aiAnthropicLegacy Anthropic crawler identifier (older crawl runs)Training
Google-ExtendedGoogleGemini and Vertex AI training opt-out tokenTraining
PerplexityBotPerplexityIndexes pages for Perplexity's answer engineSearch / Answer
Perplexity-UserPerplexityFetches URLs live during a Perplexity sessionLive Fetch / Answer
CCBotCommon CrawlOpen web crawl used as training data by many AI labsTraining
BytespiderByteDanceCrawls for TikTok and ByteDance AI productsTraining / Mixed
AmazonbotAmazonCrawls for Alexa and Amazon AI featuresTraining / Mixed
Applebot-ExtendedAppleApple Intelligence and Siri training opt-out tokenTraining
Meta-ExternalAgentMetaCrawls for Meta AI training and product featuresTraining / Mixed

The one distinction that actually matters: training vs. search-and-answer

Most of the confusion around AI crawler blocking comes from treating all AI bots as the same thing. They are not. There are two fundamentally different jobs these crawlers do, and blocking them has opposite consequences.

Training crawlers

GPTBot, Google-Extended, ClaudeBot, CCBot, and most of the others in the list above are training crawlers. They collect your content so it can be used to train or fine-tune a model. Blocking them is a legitimate, widely-accepted choice. You opt your content out of the training corpus. The model does not learn from your pages. That is the full consequence. The model still exists, still answers questions, and still cites other sources. You just are not in the training data.

Search and answer crawlers

OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot, and Perplexity-User work differently. These bots build or refresh a live retrieval index. When a user asks ChatGPT a question and the answer cites a source, that source was found by OAI-SearchBot. When Perplexity quotes your article, PerplexityBot crawled it. Block these bots and you do not just opt out of training. You disappear from live answers entirely.

Blocking GPTBot is a training opt-out. Blocking OAI-SearchBot is a visibility opt-out. Most people do not realize those are two different robots.
Ron

A Hostinger study cited by Search Engine Journal found that OpenAI's OAI-SearchBot had already surpassed 55% crawl coverage across tracked sites. That is a substantial footprint for a crawler that did not exist a couple of years ago. It means the bot is actively visiting a majority of sites, and any Disallow rule it hits produces a real, immediate gap in ChatGPT's search results.

How to check your robots.txt right now

Your robots.txt lives at yourdomain.com/robots.txt. Open it in a browser. You are looking for two things.

  1. A wildcard block that catches everything: a User-agent: * line followed by Disallow: / or a broad path. If this exists without explicit Allow rules for specific bots above it, every crawler including AI bots is blocked.
  2. Named bot blocks: search the file for GPTBot, OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot, and Claude-SearchBot. Any Disallow: / under one of these names blocks that specific bot.
  3. Order sensitivity: a more specific user-agent block overrides the wildcard for that bot. So even if User-agent: * allows everything, an explicit User-agent: OAI-SearchBot with Disallow: / underneath kills that crawler.
  4. Empty or missing file: a missing robots.txt means no restrictions at all, which is generally fine. The problems come from files with rules.

Common ways sites end up accidentally blocked

The corrected robots.txt snippet

If you want to block training crawlers but stay visible in AI-generated answers, here is the explicit configuration. Drop this into your robots.txt above your wildcard block so the named rules take priority.

robots.txt: allow search-and-answer bots, block training bots

# Allow the search/answer bots that power live AI responses
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

# Block training-only crawlers (optional, legitimate choice)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Your existing wildcard rule below this line
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

If you want full AI visibility with no restrictions, simply remove all the named bot blocks above and let the wildcard rule handle everything. The explicit Allow rules are only needed when you have a Disallow somewhere that would otherwise catch these bots.

The trade-off worth thinking through

Blocking training crawlers is not a neutral act, but it is not a harmful one either. You are choosing not to contribute your content to someone else's training corpus. Your site stays out of the model weights. That is a reasonable preference for many publishers.

The argument for allowing training crawlers is less about direct visibility and more about indirect influence. If your content shapes the model, the model's general knowledge reflects your framing, your terminology, your perspective. Research on generative engine optimization (Aggarwal et al., arXiv:2311.09735) suggests that being part of training data can influence how a model discusses related topics even without citing you directly. Whether that is worth it for your site is a judgment call.

The argument against blocking search-and-answer crawlers is much simpler: there is no upside. These bots fetch publicly available pages and surface them in answer results, the same way Google's crawler does for web search. Blocking them removes a distribution channel with no compensating benefit.

55%+

Crawl coverage already reached by OAI-SearchBot in a Hostinger study (Search Engine Journal)

One silent failure worth knowing about

Unlike a Google penalty, which eventually shows up in Search Console as a manual action or a traffic drop you can investigate, a robots.txt block for AI crawlers produces no alert. ChatGPT does not send you an email. Perplexity does not log a warning in any dashboard you can access. Your pages simply stop appearing in AI-generated answers, and unless you are actively testing your own site in those tools, you will not notice.

That is the core reason to run an audit now rather than assuming your current configuration is fine. The defaults from 2022 or 2023 almost certainly did not account for OAI-SearchBot or PerplexityBot. They did not exist yet.

What to do next

Start by pulling up your robots.txt. Scan it for the user-agents in the table above. If you find a Disallow on any of the search-and-answer bots, fix it using the snippet above. If the file looks clean at the robots.txt level, check your server-side WAF rules and any security plugins for bot filtering logic that might be operating separately.

If you want a faster read without digging through files manually, the free AI Crawler Checker tool linked below fetches your robots.txt and highlights any rules affecting known AI crawlers. It takes about ten seconds and flags training-only blocks separately from search-and-answer blocks so you know exactly what you are dealing with.

Questions

Does blocking GPTBot hurt my SEO?

No. GPTBot is a training crawler with no connection to Google's search ranking system. Blocking it has zero effect on your Google rankings or organic traffic. It only opts your content out of OpenAI's training data.

If I block Google-Extended, will my site still appear in Google Search?

Yes. Google-Extended is specifically the opt-out signal for Gemini and Vertex AI training. Googlebot, which handles web search indexing, is a separate user-agent and is not affected by a Google-Extended Disallow rule.

How often should I check my robots.txt for new AI crawlers?

Every six months is a reasonable cadence right now given how fast the space is moving. New products from established AI labs often ship a new crawler user-agent with little fanfare. Checking once a quarter is not excessive if you are serious about AI search visibility.

What if my robots.txt allows these bots but a firewall blocks them at the IP level?

The robots.txt rule is irrelevant if the request never reaches your server. Firewall or WAF blocks that drop traffic from known AI crawler IP ranges override robots.txt entirely. You would need to allowlist those IP ranges in your firewall settings, or check whether your WAF has a named rule for AI bots that can be toggled off.

R

Harsh Rana

I build Ron at 617 Software Studio, a small Boston shop. I run real AI visibility audits by hand and pour what I learn into how Ron works. These notes come from the actual reports, not a content brief. More about Ron.

Keep going

Sources

Find out what AI actually says about you.

~5 min scan · $39 · refunds if useless

Run my audit →