Does blocking GPTBot hurt my SEO?

No. GPTBot is a training crawler with no connection to Google's search ranking system. Blocking it has zero effect on your Google rankings or organic traffic. It only opts your content out of OpenAI's training data.

If I block Google-Extended, will my site still appear in Google Search?

Yes. Google-Extended is specifically the opt-out signal for Gemini and Vertex AI training. Googlebot, which handles web search indexing, is a separate user-agent and is not affected by a Google-Extended Disallow rule.

How often should I check my robots.txt for new AI crawlers?

Every six months is a reasonable cadence right now given how fast the space is moving. New products from established AI labs often ship a new crawler user-agent with little fanfare. Checking once a quarter is not excessive if you are serious about AI search visibility.

What if my robots.txt allows these bots but a firewall blocks them at the IP level?

The robots.txt rule is irrelevant if the request never reaches your server. Firewall or WAF blocks that drop traffic from known AI crawler IP ranges override robots.txt entirely. You would need to allowlist those IP ranges in your firewall settings, or check whether your WAF has a named rule for AI bots that can be toggled off.

Is Your Site Blocking GPTBot? Check robots.txt for AI Crawlers

A single accidental Disallow can erase you from ChatGPT, Perplexity, and Gemini answers without any warning.

Here is a situation that plays out constantly. A developer blocks all unknown bots after a traffic spike. Or a WordPress security plugin adds a blanket Disallow rule. Or someone copies a robots.txt template from 2019 that predates AI crawlers entirely. The site looks fine in Google Search Console. Organic traffic holds steady. But quietly, the site has vanished from every AI-generated answer.

The robots.txt file is small, easy to overlook, and now more consequential than it has been in years. This post walks through every major AI user-agent you need to know, explains the difference between training crawlers and search-and-answer crawlers (the distinction almost everyone gets wrong), and shows you the exact fix.

The complete AI crawler list, grouped by vendor

Before you can audit your robots.txt, you need to know which user-agents to look for. The list has grown fast. Here is the full set worth tracking as of mid-2026.

User-Agent	Vendor	Purpose	Type
GPTBot	OpenAI	Crawls pages to train future GPT models	Training
OAI-SearchBot	OpenAI	Builds the live search index powering ChatGPT search	Search / Answer
ChatGPT-User	OpenAI	Fetches URLs in real time when a user pastes a link in ChatGPT	Live Fetch / Answer
ClaudeBot	Anthropic	Crawls pages to train Claude models	Training
Claude-SearchBot	Anthropic	Crawls for Claude's live web-grounded answers	Search / Answer
anthropic-ai	Anthropic	Legacy Anthropic crawler identifier (older crawl runs)	Training
Google-Extended	Google	Gemini and Vertex AI training opt-out token	Training
PerplexityBot	Perplexity	Indexes pages for Perplexity's answer engine	Search / Answer
Perplexity-User	Perplexity	Fetches URLs live during a Perplexity session	Live Fetch / Answer
CCBot	Common Crawl	Open web crawl used as training data by many AI labs	Training
Bytespider	ByteDance	Crawls for TikTok and ByteDance AI products	Training / Mixed
Amazonbot	Amazon	Crawls for Alexa and Amazon AI features	Training / Mixed
Applebot-Extended	Apple	Apple Intelligence and Siri training opt-out token	Training
Meta-ExternalAgent	Meta	Crawls for Meta AI training and product features	Training / Mixed

The one distinction that actually matters: training vs. search-and-answer

Most of the confusion around AI crawler blocking comes from treating all AI bots as the same thing. They are not. There are two fundamentally different jobs these crawlers do, and blocking them has opposite consequences.

Training crawlers

GPTBot, Google-Extended, ClaudeBot, CCBot, and most of the others in the list above are training crawlers. They collect your content so it can be used to train or fine-tune a model. Blocking them is a legitimate, widely-accepted choice. You opt your content out of the training corpus. The model does not learn from your pages. That is the full consequence. The model still exists, still answers questions, and still cites other sources. You just are not in the training data.

Search and answer crawlers

OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot, and Perplexity-User work differently. These bots build or refresh a live retrieval index. When a user asks ChatGPT a question and the answer cites a source, that source was found by OAI-SearchBot. When Perplexity quotes your article, PerplexityBot crawled it. Block these bots and you do not just opt out of training. You disappear from live answers entirely.

Blocking GPTBot is a training opt-out. Blocking OAI-SearchBot is a visibility opt-out. Most people do not realize those are two different robots.

Ron

A Hostinger study cited by Search Engine Journal found that OpenAI's OAI-SearchBot had already surpassed 55% crawl coverage across tracked sites. That is a substantial footprint for a crawler that did not exist a couple of years ago. It means the bot is actively visiting a majority of sites, and any Disallow rule it hits produces a real, immediate gap in ChatGPT's search results.

How to check your robots.txt right now

Your robots.txt lives at yourdomain.com/robots.txt. Open it in a browser. You are looking for two things.

A wildcard block that catches everything: a User-agent: * line followed by Disallow: / or a broad path. If this exists without explicit Allow rules for specific bots above it, every crawler including AI bots is blocked.
Named bot blocks: search the file for GPTBot, OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot, and Claude-SearchBot. Any Disallow: / under one of these names blocks that specific bot.
Order sensitivity: a more specific user-agent block overrides the wildcard for that bot. So even if User-agent: * allows everything, an explicit User-agent: OAI-SearchBot with Disallow: / underneath kills that crawler.
Empty or missing file: a missing robots.txt means no restrictions at all, which is generally fine. The problems come from files with rules.

Common ways sites end up accidentally blocked

✓Security plugins like Wordfence or iThemes Security on WordPress sometimes add aggressive Disallow rules for unknown bots by default.
✓CDN or WAF configurations (Cloudflare, Sucuri) that block user-agents not on an allowlist. AI crawlers are new enough that many default lists do not include them.
✓A template robots.txt copied from a high-traffic site that deliberately blocks all third-party crawlers.
✓A developer testing a staging block who published the robots.txt to production by mistake.
✓Rate-limiting rules that send 429s to any bot hitting more than a few pages per minute. Robots.txt technically allows the crawl but the server refuses it in practice.

The corrected robots.txt snippet

If you want to block training crawlers but stay visible in AI-generated answers, here is the explicit configuration. Drop this into your robots.txt above your wildcard block so the named rules take priority.

robots.txt: allow search-and-answer bots, block training bots

# Allow the search/answer bots that power live AI responses
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

# Block training-only crawlers (optional, legitimate choice)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Your existing wildcard rule below this line
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

If you want full AI visibility with no restrictions, simply remove all the named bot blocks above and let the wildcard rule handle everything. The explicit Allow rules are only needed when you have a Disallow somewhere that would otherwise catch these bots.

The trade-off worth thinking through

Blocking training crawlers is not a neutral act, but it is not a harmful one either. You are choosing not to contribute your content to someone else's training corpus. Your site stays out of the model weights. That is a reasonable preference for many publishers.

The argument for allowing training crawlers is less about direct visibility and more about indirect influence. If your content shapes the model, the model's general knowledge reflects your framing, your terminology, your perspective. Research on generative engine optimization (Aggarwal et al., arXiv:2311.09735) suggests that being part of training data can influence how a model discusses related topics even without citing you directly. Whether that is worth it for your site is a judgment call.

The argument against blocking search-and-answer crawlers is much simpler: there is no upside. These bots fetch publicly available pages and surface them in answer results, the same way Google's crawler does for web search. Blocking them removes a distribution channel with no compensating benefit.

55%+

Crawl coverage already reached by OAI-SearchBot in a Hostinger study (Search Engine Journal)

One silent failure worth knowing about

Unlike a Google penalty, which eventually shows up in Search Console as a manual action or a traffic drop you can investigate, a robots.txt block for AI crawlers produces no alert. ChatGPT does not send you an email. Perplexity does not log a warning in any dashboard you can access. Your pages simply stop appearing in AI-generated answers, and unless you are actively testing your own site in those tools, you will not notice.

That is the core reason to run an audit now rather than assuming your current configuration is fine. The defaults from 2022 or 2023 almost certainly did not account for OAI-SearchBot or PerplexityBot. They did not exist yet.

What to do next

Start by pulling up your robots.txt. Scan it for the user-agents in the table above. If you find a Disallow on any of the search-and-answer bots, fix it using the snippet above. If the file looks clean at the robots.txt level, check your server-side WAF rules and any security plugins for bot filtering logic that might be operating separately.

If you want a faster read without digging through files manually, the free AI Crawler Checker tool linked below fetches your robots.txt and highlights any rules affecting known AI crawlers. It takes about ten seconds and flags training-only blocks separately from search-and-answer blocks so you know exactly what you are dealing with.

Is Your Site Blocking GPTBot? How to Check robots.txt for AI Crawlers