Content
summary Summary

An analysis by Cloudflare shows that Bytespider, Amazonbot, and ClaudeBot are among the most active AI crawlers on the web.

Ad

Over the past year, Cloudflare has analyzed which AI crawlers with known user agent strings have the highest request volume. Bytedance's Bytespider crawler tops the list of most active AI web crawlers, followed by Amazonbot, ClaudeBot, and OpenAI's GPTBot.

Bytespider likely collects training data for Chinese ChatGPT competitor Doubao, while Amazonbot primarily indexes Alexa responses, according to Cloudflare. ClaudeBot collects training data for Anthropic's Claude models.

Bytedance and Anthropic currently seem to be crawling particularly hard for AI training data. | Image: Cloudflare

OpenAI's GPTBot, which collects training data for products such as ChatGPT, is the second most blocked AI bot and also the bot with the second most website crawls. Bytespider tops both rankings.

Ad
Ad
Image: Cloudflare

Cloudflare's analysis also shows that many website operators are unaware of the extent of AI crawler activity. According to Cloudflare, AI bots crawled approximately 39% of the top 1 million domains supported by Cloudflare in June. Only 2.98% of those sites blocked or filtered requests.

Higher-ranked sites are more likely to be targeted by AI bots and are more likely to block them, as their content is often part of their core business, and they have the resources to implement technical measures.

Image: Cloudflare

OpenAI's GPTBot ranks second in the number of websites crawled, indicating that OpenAI continues to collect a substantial amount of data despite a lower crawling frequency compared to Bytedance's and Anthropic's bots. This could be due to more efficient or selective data collection by GPTBot, as well as OpenAI's existing large dataset from previous crawling processes.

OpenAI CEO Sam Altman recently stated that the focus is now on learning more from high-quality data, rather than accumulating more data. In addition, OpenAI likely already has a large amount of data from previous crawling processes, as Altman also said that his company has enough material to train the next generation of AI models.

The fact that OpenAI's GPTBot is the most explicitly blocked AI bot via robots.txt is likely due to OpenAI's transparent communication about this option and the fact that ChatGPT is the most well-known and controversially discussed AI platform in terms of privacy and copyright.

Recommendation
OpenAI's AI crawlers are by far the most specifically blocked using robots.txt. | Image: Cloudflare

Cloudflare has also observed that AI bots are increasingly disguising themselves as regular browsers to gain access to content. Perplexity was recently criticized for this practice.

The crawlers change their user agent string. However, Cloudflare's global machine learning models reliably detect such crawlers based on patterns without the need for manual training, according to the company's analysis.

To support website operators, Cloudflare has introduced a new feature for all customers that allows all AI bots to be blocked with one click in the dashboard. The feature will be continuously expanded with new fingerprints as Cloudflare identifies more crawlers.

Cloudflare also offers a reporting tool that can be used to report AI crawlers to the company so that they can be analyzed and automatically blocked in the future.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Cloudflare analyzed the most active AI web crawlers on the Internet based on their query volume. Bytedance's Bytespider tops the list, followed by Amazonbot and Anthropic's ClaudeBot. Bytespider and OpenAI's GPTBot are the leaders for crawled websites.
  • Few site owners are aware of the extent of AI crawler activity and are actively blocking it. In June, according to Cloudflare, AI bots crawled about 39 percent of the top 1 million domains, but only 2.98 percent of those sites blocked or filtered the requests.
  • AI bots are increasingly disguising themselves as regular browsers to gain access to content. Cloudflare now offers the ability to block all AI bots with a single click in the dashboard, as well as a reporting tool to report new crawlers.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.