We analyzed the top 1000 websites in the world to identify which sites are already blocking the GPTBot. The latest update to this study was performed in August 2024 in response to OpenAI’s July 2024 announcement of SearchGPT and additional OpenAI Crawlers.
Previously, OpenAI shared details on how to block its GPTBot on Aug 7, 2023, which prompted this study to review the response of the top 1000 websites.
One concern, as an AI detector, is the risk that LLMs will further scrape content and make AI writing tools that become undetectable to AI checkers. This study analyzes how they are responding.
Historical updates: August 2024, September 22, 2023, and August 29, 2023.
First published: August 22, 2023.
After spending months cozying up to publishers and signing multiple partnerships, OpenAI announced SearchGPT on July 25, 2024.
SearchGPT is a search engine that’s meant to "give you fast and timely answers with clear and relevant sources."
Of the top 1000 website publishers in this study, 14 — including NYTimes, Wired, NewYorker, Vogue, and others are already blocking it from including their website in SearchGPT results.
Along with the announcement, they released details on additional OpenAI Crawlers:
"OAI-SearchBot is used to link to and surface websites in search results in the SearchGPT prototype. It is not used to crawl content to train OpenAI’s generative AI foundation models. To help ensure your site appears in search results, we recommend allowing OAI-Searchbot in your site’s robots.txt file and allowing requests from our published IP ranges below."
Despite their assurances that OAI-SearchBot is "not used to crawl content to train OpenAI’s generative AI foundational models," 14 leading publishers have decided to block it, essentially excluding themselves from being linked to in the SearchGPT results.
Considering the recency of OAI-SearchBot’s release, additional web publishers may begin blocking the OAI-SearchBot in the coming weeks. We’ll review and update this article to provide you with a live dashboard of the latest data.
Google has provided increased ability to control how Google's AI bots use the content on your website.
Announcing on Sep 28 that Google Extended will be used - https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
The first 2 websites to block it are abc.net.au, and francebleu.fr
Differences from 1 month ago…
Websites rushing to block GPTBot but other crawlers continue to scrape its content...
It is not clear why some sites would block 1 crawler bot but allow others.
See Updated Results Here
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
it is not clear if "anthropic-ai" and "claude-web" would be effective as there has been no documentation from Anthropic.
If you have any questions about this study please contact us.
Download Results Here
Since the launch of GPTBot 26.15% of the top 1000 websites are now blocking it.
OpenAI launched GPTBot on August 7th and shortly after the first “Top 100” website to block GPTBot was Reuters.com
This can be verified looking at Archive.org and inspecting the timestamp for Reuters Robots.txt page: Click Here
Reuters is also the only website that seems to be trying to block Anthropic and Claude2.
Within the first 2 weeks of launching GPTBot these are the biggest websites in the world that had blocked GPTBot from accessing its site.
26% of the top 1000 websites are blocking GPTBot while 14% are blocking CCBot, 7% blocking ChatGPT-User and only 0.2% are attempting to block Anthropic.
GPTBot launched Aug 7 and since then more and more sites have blocked it along with driving an increased interest in blocking other LLM bots.
Updated Sep 20
See Most Up To Date List Here
The table Below is Presented as...
To block the OpenAI GPTBot from your site install the following code on your Robots.txt file:
# Disallow GPTBot
User-agent: GPTBot
Disallow: /
For additional details see OpenAI’s documentation: https://platform.openai.com/docs/gptbot
Given that OpenAI has already consumed much of the internet using datasets like the CommonCrawl why launch GPTBot?
There are several possible reasons:
No, blocking GPTBot does not remove the knowledge that LLM’s have gained by training on existing web content they have already accessed.
Web crawling or scraping is a process where automated software visits websites to gather specific information from their pages. This is commonly used by search engines to index content. While useful, this practice can sometimes be contentious, especially if done without the website owner's consent.
In a 2019 case between Linkedin and HIQ Labs the ability to scrape publicly available websites was upheld. See TechCrunch article or see Decision.
However, some of the current lawsuits against OpenAI seem to be challenging this.
While it may seem premature to compare GPTBot to giants like Google, the analogy isn't without merit. The most significant concern for websites considering blocking GPTBot is the potential missed opportunity. As ChatGPT evolves and integrates with the internet more intimately, it could serve a role similar to that of a search engine. By providing users with direct links or references from web sources, ChatGPT can direct significant traffic to those sites. If GPTBot is blocked, that site's content may not be among the recommended sources, essentially sidelining potential visitors. In essence, just as blocking Google would prevent a website from appearing in one of the world's most popular search engines, blocking GPTBot might mean missing out on a burgeoning channel of web traffic.
Hopefully, this study was helpful.
If you have any questions please don’t hesitate to contact us.
Over 30% of Reviews on Capterra are AI Generated
List of Companies Banning ChatGPT
Up to Date List of OpenAI and ChatGPT Lawsuits
400% Increase in AI Generated Amazon Reviews
OpenAI Papers and Publications List