Thoughts on AI Crawling

A while ago I noticed a push within my community to block all AI web crawlers:

I think these posts bring up valid concerns about the control that creators have over their work, and how little information is disclosed by many of the groups gathering this information. Data is becoming extremely valuable in this industry and scrapers can also end up eating bandwidth or other resources at the expense of the creators.

I’m still working through how I feel about these for my own site, so this is a kind of thinking-out-loud post.

Different Kinds of Bots

One of the services I use has started blocking requests from specific user-agents using this list. I don’t have any huge objections with this list, these agents are all at least tangentially related to AI. Blocking these agents will definitely save the operator some bandwidth, and it will block very few direct user requests. However, there are a few distinct types of bots on the list:

Each of these levels could of course be broken down even more (and there is often a blurring of lines between them) but I think these three tend to cover most “AI scrapers” that people tend to block.

When evaluating if you would want to allow one of these scrapers to access your site, you could use something like the following rubric:

Evaluating Scrapers

Now let’s try to evaluate the different types of scrapers from above:

I think this pretty closely matches my thoughts: all requests by an AI or a human will take resources, and the traditional way to offset those resources is to serve ads or build a following/collect influence over time. When the content is used out of context it still takes some resources to serve (plus whatever it took to produce) but the creator no longer gets those traditional benefits. Depending on how much each request costs you, how much you make from ads/impressions, and how much you value making your content accessible for other people, this will change where the line is for you.

Comparing to Other Technologies

While the comparisons are not perfect, I think it’s valuable to compare these scrapers to other alternative web browsing applications:

Personally, I would categorize these three as more impactful to most web creators than any of the AI scrapers. These haven’t caused any significant disruption to me or my community, so I wouldn’t expect AI crawlers to do so either.

AI Scrapers on My Site

I looked into a few ways of blocking AI scrapers on my site/content:

Personally I don’t think any of these are great options. I think a license is a good standard to move towards, while the Cloudflare block has the best chance of succeeding at actually stopping anything. I expect that the Cloudflare blocking will add relatively few false positives, but I would like to increase accessibility, not decrease it. For now I am keeping my content copyrighted as CC BY-NC-SA, though I would like to move towards CC BY or CC0. I value my writing but I also think that more open copyright licenses are generally preferable and have appreciated it when others have freer licenses.