Across the internet, there is a new wave of well-intentioned filtering, often in the form of the excellent Anubis or its kin. These filters help many websites manage costs and enforce administrator preferences/opinions. And as much as I genuinely love to see Anubis weighing my soul, I worry when I see sites using filters that apply extremely broad user agent filtering.

As an example Anubis1 ships with a default list of “AI Robots” that contains many questionable entities, unless you have an absolute rule that everything related to AI/LLMs should be blocked, including human users.

Some examples of “user agents” blocked in these lists which may not be appropriate for many (or in some cases any) users:

Human-Directed User Agents

These user agents are operated by AI systems at the explicit direction of humans. Typically this would be someone searching the web, or asking a question about recent events. And the companies state that the data they retrieve is not used for training, only for presentation to the user.

Search Indexing User Agents

These user agents exist for indexing, akin to Googlebot. To the extent they are AI related, they typically have separate flags that enable/disable inclusion in training datasets.

  • Applebot: Search indexing bot, AI usage is gated by Applebot-Extended, see below.
  • Claude-SearchBot: Block training with Claudebot
  • DuckAssistBot: This one is an odd duck, but isn’t used for training regardless.
  • OAI-SearchBot: Search indexing bot, block model training using the GPTBot user agent.

robots.txt exclusive User Agents

These user agents don’t make any sense in a filter list, as they are only relevant as “flags” in robots.txt. No request will ever use these, and they have no business in a request blocklist.

But Why Care?

If your goal is to block all AI/LLM tools, and their human users, then these blocklists are for you!

But if you have a more specific concern, like your website being used for training, then why are you blocking all of these other user agents? You are blocking your site from appearing in people’s searches, research, and more.

And by allowing Google/Bing to access your site, especially if you aren’t setting the robots.txt flag, we end up in a world where the big players grow even larger. Maybe we should encourage a bit of competition…


  1. I genuinely love Anubis, and don’t mean to pick on it here. I use it as an example only because of its recent popularity and freshly updated block list. There are innumerable other block lists available on the internet, and I suspect Anubis contributors drew from those. ↩︎