Here is a comprehensive table of all major AI answer engine bots that website owners may choose to allow or disallow in their robots.txt files. This includes the bot name (AI engine/company), the User-agent directive, and the recommended robots.txt command.
5 key things you need to know about robots.txt files:
- Controls Search Engine Access: The robots.txt file tells web crawlers (like Googlebot) which parts of a website they are allowed or disallowed to access and index.
- Located at Root Directory: It must be placed in the root directory of a website (e.g.,
https://example.com/robots.txt
) to be recognized by search engines. - Uses Simple Directives: It uses directives like
User-agent
,Disallow
, andAllow
to specify crawler behavior for specific bots or all bots (User-agent: *
). - Does Not Enforce Security: It only provides guidelines to bots; it doesn’t physically prevent access. Sensitive data should never rely solely on robots.txt for protection.
- Can Affect SEO: Incorrectly configured robots.txt files can block important pages from being indexed, which may hurt a site’s visibility in search results.
List of most common AI anwer engine crawl bots:
AI Engine / Company | User-agent (Crawl Bot) | robots.txt Command Example | Purpose / Notes |
OpenAI (ChatGPT, GPTBot) | GPTBot | User-agent: GPTBot\nAllow: / | Used by OpenAI to crawl for training ChatGPT |
OpenAI (ChatGPT, Web Crawler) | ChatGPT-User | User-agent: ChatGPT-User\nAllow: / | Used by OpenAI to browse for real-time answers |
Anthropic (Claude) | ClaudeBot | User-agent: ClaudeBot\nAllow: / | Anthropic’s Claude web crawler |
Google (Bard, Gemini) | Google-Extended | User-agent: Google-Extended\nAllow: / | Controls whether Bard/Gemini can use content from Google index |
Perplexity AI | PerplexityBot | User-agent: PerplexityBot\nAllow: / | Actively crawls for Perplexity’s answer engine |
Meta (Facebook AI) | FacebookExternalHit | User-agent: FacebookExternalHit\nAllow: / | Used by Meta products, sometimes for AI purposes |
CCBot (Common Crawl) | CCBot | User-agent: CCBot\nAllow: / | Common Crawl used by many AI models as a base dataset |
Amazon (Alexa AI, legacy) | Amazonbot | User-agent: Amazonbot\nAllow: / | Used by Amazon’s AI teams, albeit limited currently |
You.com | YouBot | User-agent: YouBot\nAllow: / | AI search engine that uses its own crawler |
NeevaAI (acquired by Snowflake) | NeevaBot | User-agent: NeevaBot\nAllow: / | Formerly used for AI search, now part of Snowflake (legacy bot) |
DuckDuckGo (AI Answers) | DuckDuckBot | User-agent: DuckDuckBot\nAllow: / | Crawler for DuckDuckGo and its AI engine |
Brave Search (AI Summarizer) | BraveBot | User-agent: BraveBot\nAllow: / | Powers Brave’s AI search and summarisation tools |
Andi Search AI | andi-bot | User-agent: andi-bot\nAllow: / | Crawler for Andi, an AI-powered search assistant |
Phind | PhindBot (not confirmed, rarely declared) | User-agent: *\nAllow: / (if unsure) | Likely uses common indexes or 3rd-party bots |
Apple (Applebot) | Applebot | User-agent: Applebot\nAllow: / | May be used for Siri and Apple’s upcoming AI tools |
Why are robots.txt files important
The robots.txt
file is a plain text file located in the root directory of a website that provides instructions to web crawlers (also known as robots or spiders) about which pages or sections of the site they are allowed or disallowed to access. It uses a specific syntax defined by the Robots Exclusion Protocol, with directives like User-agent
, Disallow
, Allow
, and Sitemap
to communicate these rules. For example, a site might block crawlers from indexing internal folders or duplicate content that isn’t relevant for search engines, helping ensure that only the most important and relevant content is crawled and indexed.
This file plays a crucial role in search engine optimization (SEO) and server efficiency. By managing crawler behavior, it helps prevent overloading the server with unnecessary requests and ensures sensitive or non-public content remains hidden from search engine results.
While it doesn’t guarantee that pages will be excluded from indexing (especially if those pages are linked elsewhere), it serves as the first line of defense in controlling how a website is explored and represented online.
How to edit your robots.txt file
Editing your robots.txt
file involves accessing the root directory of your website—typically via your content management system or hosting provider—and updating the plain text file to control how search engine crawlers interact with your site. You can use directives like Disallow
, Allow
, and User-agent
to restrict or permit crawler access to specific pages or sections.
This file must be formatted correctly and saved at yourdomain.com/robots.txt
to function properly. Responsibility for managing and updating the robots.txt
file typically falls to the SEO specialist, technical webmaster, or super-admin, as improper configuration can impact site visibility and indexing.
Any questions, feel free to reach out to the team today. We’re happy to help.