Should You Allow or Block AI Crawlers? (GPTBot, ClaudeBot, PerplexityBot, Google-Extended)
AI VisibilityAEOLLM Optimization

Should You Allow or Block AI Crawlers? (GPTBot, ClaudeBot, PerplexityBot, Google-Extended)

AI crawlers do three different jobs: train models, build the citation index, and fetch pages on demand. Here is which to allow, which to block, and a ready-to-copy robots.txt.

R
Rankry Team
· 10 min read · Updated

Short answer: allow the bots that read your site to answer questions and cite you, and make a deliberate choice about the bots that only collect content to train models. Those are two different jobs, and treating them as one is the mistake most sites make.

Blocking the wrong bot can quietly remove you from ChatGPT, Claude, and Perplexity answers within hours. Blocking the right one costs you nothing you were ever going to get back. This guide explains which is which, gives you a ready robots.txt to copy, and shows you how to confirm it actually worked.

The AI crawler landscape

There is no single “AI bot.” There is a cast of them, and each major provider runs more than one. Here are the ones that matter in 2026, grouped by who runs them.

OpenAI runs three. GPTBot collects content to train OpenAI’s models. OAI-SearchBot builds the index behind ChatGPT Search and is used for retrieval and citation, not training. ChatGPT-User fetches a single page on demand when someone inside ChatGPT asks the model to open a URL.

Anthropic runs three. ClaudeBot collects content for training. Claude-SearchBot powers Claude’s search and retrieval, and it is controllable separately from ClaudeBot. Claude-User fetches a page on demand when a user asks Claude to look at a link.

Perplexity runs two. PerplexityBot indexes pages so they can be cited in Perplexity answers. Perplexity-User fetches a page live to answer a specific question.

Google and Apple work differently. Google-Extended is not a separate crawler, it is a token in robots.txt that controls whether your content is used for Gemini and other Google AI products. Blocking it does not affect your Google Search ranking at all, because that is handled by Googlebot. Apple uses the same pattern: Applebot-Extended opts you out of Apple’s AI training, while regular Applebot still powers Siri and Spotlight.

Then there is the rest: CCBot from Common Crawl, whose dataset feeds many models, plus Amazonbot, Meta-ExternalAgent, and a long tail of scrapers. Most of these are training collectors.

Here is the whole cast in one view, sorted by the job each bot does:

BotProviderJobWhat it doesOur call
GPTBotOpenAITrainingCollects content to train OpenAI’s modelsDecide
ClaudeBotAnthropicTrainingCollects content to train Anthropic’s modelsDecide
CCBotCommon CrawlTrainingOpen dataset that feeds many modelsDecide
Google-ExtendedGoogleTraining (token)Opts your content in or out of Gemini and Google AIDecide
Applebot-ExtendedAppleTraining (token)Opts your content in or out of Apple AI trainingDecide
Meta-ExternalAgentMetaTrainingCollects content for Meta’s AIDecide
OAI-SearchBotOpenAICitation and retrievalBuilds the index behind ChatGPT SearchAlways allow
Claude-SearchBotAnthropicCitation and retrievalPowers Claude’s search and retrievalAlways allow
PerplexityBotPerplexityCitation and retrievalIndexes pages to cite in Perplexity answersAlways allow
ChatGPT-UserOpenAIOn-demandFetches a page a user opens inside ChatGPTAllow
Claude-UserAnthropicOn-demandFetches a page a user shares with ClaudeAllow
Perplexity-UserPerplexityOn-demandFetches a page live to answer a questionAllow

Citation bots vs training crawlers

This is the distinction the whole decision rests on, so it is worth being precise.

Training crawlers (GPTBot, ClaudeBot, CCBot, and the Google-Extended and Applebot-Extended tokens) fetch your content to fold into a model’s training data. Two things follow from that. First, they drive no referral traffic, ever. Second, blocking them is not retroactive. Once your content is already in a training set, the model knows it whether you block the bot tomorrow or not. Blocking only affects future training runs.

Citation and retrieval bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) are the opposite. They build the live index that AI engines pull from when they answer a question and name their sources. These are visibility infrastructure. Block one and your pages stop showing up in that engine’s answers, often within hours.

On-demand user bots (ChatGPT-User, Claude-User, Perplexity-User) sit in between. They fetch one page when a specific user pastes your link and asks the AI to read it. Block these and people can no longer bring your URLs into their AI conversations.

So the honest framing is not “should I allow AI bots.” It is “I want the citation and retrieval bots, I have to decide about the training crawlers, and I should ignore the junk.”

The strategic trade-off

For most businesses, the right move is to allow every bot that can cite you, and then make one clean decision about training.

The reason to allow citation and retrieval is simple. AI search is now a real discovery channel, and the only way to appear in it is to let the engines read you. If you block PerplexityBot, you are not in Perplexity. If you block OAI-SearchBot, you are not in ChatGPT Search. Your competitors will be, and their information will answer the questions your customers are asking.

Training is the genuine judgment call, and it splits into two reasonable camps.

  1. Allow training too. If your goal is for AI to know your brand exists and recommend it, being in the training data helps. The model carries that knowledge even when it is not actively browsing. Most brands chasing AI visibility land here.

  2. Block training, allow citation. If you have content you do not want absorbed into a foundation model, or you object to uncompensated training use on principle, you can opt out of training while staying fully visible in AI answers. OpenAI’s own documentation supports this: you can disallow GPTBot while allowing OAI-SearchBot. Anthropic’s Claude-SearchBot is independent from ClaudeBot for the same reason. The decision is per category, not all or nothing.

There is now a cleaner way to express that intent than block rules alone. In late 2025 Cloudflare introduced the Content Signals Policy, a small addition to robots.txt that states how your content may be used after it is fetched, not just whether it can be fetched. It defines three signals: search for building a search index and showing links and snippets, ai-input for using your content in real-time AI answers such as retrieval and grounding, and ai-train for training or fine-tuning models. You set each to yes or no. A site that wants to be cited but not trained on would set ai-train=no; a site that wants both, like ours, sets ai-train=yes. It has already been added to millions of domains, and restrictions expressed this way are written as a reservation of rights under EU copyright law, which gives them legal weight.

Two warnings before you rely on any of this.

First, robots.txt and content signals are requests, not walls. Well-behaved bots from OpenAI, Anthropic, Perplexity, and Google honor them. Aggressive scrapers ignore them or spoof their user agent. If you truly need to stop a bot, enforce it at the edge with a firewall or bot management rule, because a CDN or WAF rule overrides robots.txt anyway.

Second, do not confuse the AI tokens with your search ranking. Blocking GPTBot has no effect on Google. Blocking Google-Extended only opts you out of Gemini and Google’s AI products, never your Google Search ranking.

Here is the setup we run on rankry.ai. We made the visibility-maximizing choice: allow every bot that can cite us or learn us, and block only the bulk scrapers that return nothing. We want the models to recommend us today and know us tomorrow. The live file lists each bot in its own group and closes a few private paths. This is the same configuration, grouped here for readability. Copy it, swap in your domain, and flip the training stance if your priorities differ.

# Updated: 2026-05-01
# ----------------------------------------------------------------------
# Content usage signals (Cloudflare Content Signals Policy)
# search   = yes  -> may index and show links/snippets
# ai-input = yes  -> may use our content in live AI answers (citation)
# ai-train = yes  -> may use our content to train models (we want to be known)
# ----------------------------------------------------------------------
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=yes
Allow: /
Disallow: /api/
Disallow: /app/
Disallow: /admin/
Disallow: /sign_up
Disallow: /sign_in

# ----- Every bot that can cite OR learn us: ALLOW the public site, close private paths -----
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: Bingbot
Allow: /
Disallow: /api/
Disallow: /app/
Disallow: /admin/
Disallow: /sign_up
Disallow: /sign_in

# ----- Bulk training scrapers with no referral value: DISALLOW entirely -----
User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

Sitemap: https://rankry.ai/sitemap.xml
Sitemap: https://rankry.ai/blog/sitemap-index.xml

A few notes on why it is built this way. The Content-Signal line states our intent for every bot up front: index us, ground answers on us, and train on us. Every bot that can put us in front of a user, whether by citing us live or learning us into a model, is allowed across the public site, with only our private and account paths closed. CCBot and Bytespider are blocked outright because they are bulk training scrapers that return nothing and only cost bandwidth. If your priorities are different and you would rather stay out of training data, set ai-train=no and move the training crawlers like GPTBot and ClaudeBot into the disallow block. The citation and retrieval bots should stay allowed either way, because that is your presence in AI answers.

How to verify it worked

Editing the file is the easy part. Confirming it took effect is where sites get a false sense of safety.

  1. Read your own server access logs. Look for each bot by name and confirm the response codes. Allowed bots should return 200, disallowed ones should stop appearing. The log is the only proof that matters.

  2. Check your CDN and firewall first. Cloudflare, Fastly, and similar tools can block or challenge bots before a request ever reaches robots.txt. If a bot you allowed is missing from your logs, the block is almost always there, not in the text file.

  3. Expect a harmless warning. Google Search Console may flag the Content-Signal line as syntax it does not understand. That is fine. It is a newer directive, and Cloudflare reports no crawl impact from the warning.

  4. Re-check quarterly. New AI crawlers launch constantly, and providers rename and split their agents. A robots.txt that was correct in January can be incomplete by spring.

The bottom line

Allowing or blocking AI crawlers is not one switch. Allow the bots that read and cite you, because that is your presence in AI search. Decide deliberately about the training crawlers, and state your intent cleanly with content signals. Then go read your logs and prove it.

If you want to know whether AI engines can actually crawl, parse, and cite your site today, that is exactly what our AI Readiness check measures, across crawl access, rendering, structure, and citation-readiness. And once the bots can reach you, the next job is giving them something clean to lift, which we cover in optimizing your content for AI search.

Tagged: AI VisibilityAEOLLM Optimization
Enjoyed this article?
Share it with your network

Track your AI visibility

See how your brand appears across ChatGPT, Gemini, Perplexity and other AI assistants.

Try Rankry