Support this Site

Become a Six Colors member to read exclusive posts, get our weekly podcast, join our community, and more!

By Dan Moren

June 13, 2024 11:06 AM PT

Excluding your website from Apple’s AI crawler

One of the more contentious announcements from Apple this week is that it trained its foundation models used as the basis for its forthcoming Apple Intelligence features via, among other content, the open web.

Obviously that raised a lot of eyebrows among those of us that publish content to the web. Using copyrighted material to train AI falls under fair use is a question still being hotly debated, and one that may ultimately be highly dependent on the exact circumstances. It’s also a behavior that people feel justifiably uncomfortable with.

Setting aside feelings on that issue for just a moment, it’s worth looking at the mechanics behind this. Apple also said during its announcement that it’s providing a way for publishers to exclude their sites from being used for training its AI models, via a long established system built originally for search engines: robots.txt.

Robots.txt or not

If you’re not familiar with robots.txt, it’s a text file placed at the root of a web server that can give instructions about how automated web crawlers are allowed to interact with your site. This system enables publishers to not only entirely block their sites from crawlers, but also specify just parts of the sites to allow or disallow.¹

Apple’s own Applebot web crawler has existed for some time; it’s used to power search features in Spotlight, Siri, and Safari. For example, whenever you see a Top Hit in Safari or Spotlight, that information is coming from a search index created using Applebot. Podcasts are also fed by Applebot, though in those cases only via specific URLs registered with Apple Podcasts.

If you’re worried that blocking Applebot from crawling your site might impact your site showing up in traditional search results, good news: the new AI training element of Applebot uses a separate identifier, allowing you to block only that functionality without affecting your site’s appearance in Apple’s search features.

How to stop Apple from using your content in AI training

Apple provides a detailed support document about Applebot and the various directives to control how it interacts with your site.

To specifically exclude your whole site from being used for Apple’s AI training features, you can add the following to your robots.txt file:

User-agent: Applebot-Extended Disallow: /

To test this out, I’ve added those directives to my personal site. This turned out to be slightly more confusing, given that my site runs on WordPress², which automatically generates a robots.txt file. Instead, you have to add the following snippet of code to your functions.php file by going to the administration interface and choosing Appearance > Theme File Editor and selecting functions.php from the sidebar. (You can also do this via a plugin like Code Snippets, which I use.)

add_filter('robots_txt', 'my_robots_commands', 99, 2); // filter to add robots

function my_robots_commands($output, $public) {
  $output .= "User-agent: Applebot-Extended\nDisallow: /";
  return $output;
}

If you want to go beyond Apple, this same general idea works for other AI crawling tools as well. For example, to block ChatGPT from crawling your site you would add a similarly formatted addition to the robots.txt file, but swapping in “GPTBot” instead of “Applebot-Extended.”³

Google’s situation is more complex: while the company does have a Googlebot-Extended that powers some of its AI tools, like Gemini (née Bard), blocking that won’t necessarily remove your site’s content from being crawled for use in Google’s AI search features. To do that, you’d need to block Googlebot entirely, which would have the unfortunate effect of removing your site from its search indexes as well.

Fait accompli

One challenge with these AI tools, however, is whether or not the damage has already been done. Many of these AI models have, of course, already been trained, and it’s not as though you can remove training data from them by blocking these crawlers now—it’s very much closing the barn doors after the horses have gotten out.

However, it does mean that you can prevent your site and content from being used for training going forward, so any material you publish from that point on will be excluded. If you feel like that’s cold comfort, you’re not alone.

Update: An earlier version of this article misstated the user agent for ChatGPT’s crawler: it’s GPTBot, not ChatGPT-User.

It’s worth noting that robots.txt has no legal or technological basis for enforcement: it’s essentially a convention that companies have agreed to abide by. ↩
I used this method to block crawlers on the Six Colors WordPress instance.—Jason ↩
To block both crawlers, you’d just add separate directives in the robots.txt for each of them. ↩

[Dan Moren is the East Coast Bureau Chief of Six Colors. You can find him on Mastodon at @dmoren@zeppelin.flights or reach him by email at dan@sixcolors.com. His latest novel, the supernatural detective story All Souls Lost, is out now.]

If you appreciate articles like this one, support us by becoming a Six Colors subscriber. Subscribers get access to an exclusive podcast, members-only stories, and a special community.