arrow_back Back to Blog

"Do Not Train" robots.txt, Image Meta Data, and Other Art Protection

Share Blog Post

Blog Topics

I write about art, design, theology, philosophy, and psychology.

Notifications

I don’t maintain a list of blog subscribers or send newsletters, so if you’d like to be notified of when I post articles, there are 3 options:

Join My Discord Server
This is my main hangout. I post daily content relevant to artists there – articles (not just my own), inspiration, art related memes, art resources, and I also chat socially with those on the Discord.
Follow me on Substack
I post summaries of my blog posts on Substack, and I also restack articles I find interesting.
Follow my Flipboard Magazine
I have a magazine called RoxyLovesArt where I collate articles I find interesting. These are typically art related and anti ai.

Tools to Protect Your Art from AI

“Do Not Train” robots.txt, Image Meta Data, and Other Art Protection

I recently watched the Senate Judiciary Committee hearing on AI and copyright, where Ben Brooks (Stability AI) said that they “respect robots.txt wishes”, implying that you can prevent your work/content from being scraped by setting a ‘donottrain’ directive in your website’s robots.txt file.

If you’re interested in watching the hearing, you can do so here, but basically Ben Brooks was responding to being questioned about the problem of artists not wanting their work scraped by bots for datasets that are used for training generative AI.

As a web designer of 2 decades I thought I’d take a minute to say that there is no way to tell robots that you don’t want your work used for training purposes. Arguably, you could do this:

User-agent: *
Disallow: /

…but what that code does, is it tells all bots (not just data scrapers) that you don’t want them having access to any data on your site. You very likely don’t want to do that though…

Why banning all bots is not a viable solution…

If you want people to be able to find your site on search engines like Google, Bing and Duck Duck Go, to name a few, then you definitely don’t want to disallow all bots. There are a lot of benign bots like webcrawlers that need to be able to scan your website’s content in order know what it’s about. If they don’t know what it’s about, your site won’t be indexed by search engines, and your site won’t be recommended/appear in search results when people look for your content.

Banning all bots except specific ones

If you wanted to ban all bots, except for those known as search engine webcrawlers, you could do something like this:

User-agent: *
Disallow: /
User-agent: Googlebot, Bingbot, DuckDuckBot, YandexBot
Allow: /

So in the code above, first we ban all bots, and then we make exceptions. In this case we are specifically allowing Google, Bing, Duck Duck Go and Yandex’s webcrawlers. There are probably many others we could add there, but I’m just naming some big ones as an example.

There are two problems with this method:

Some webcrawlers may have a dual purpose. Googlebot and Bingbot are very likely being used to also scrape data to train their AI systems, Bard and ChatGPT respectively.
Besides webcrawlers, there may be other benign bots that offer security or functionality to your website, so there may be other unforeseen disadvantages of banning all bots.

Banning specific bots

We could go the other route of allowing all bots except specific ones. If we knew the exact names of all the bots that are specifically scraping data to train AI, then we could write the code like this:

User-agent: BotName, AnotherBotName
Disallow: /

The problem is of course that companies like StabilityAI have not publicly disclosed the names of the bots that scrape data for their services.

So ultimately, we can see that claims by Ben Brooks (Stability AI’s head of public policy) about them “respecting robots.txt” is completely vacuous and means absolutely nothing. There is nothing we can put into action from what he’s said in any meaningful way.

Can you protect your images via meta data?

Even if there was a protocol for specifically denying bots from scraping your content via robots.txt, that would only hold true if posted on your own site. As soon as you post to social media or an external portfolio site, your data is out of your hands. This is where you might hear the AI bros making claims of them respecting “Do Not Train” directives in the meta data. So how would we do that?

Right click on your jpg image
Click ‘Properties’ from the context menu
Click the ‘Details’ tab
In the tags section you can add some tags
Click Ok

The AI companies haven’t been very forthcoming about exactly what the tags should be, but I assume donottrain, and noai would be safe bets.

So you can do that, but just be aware that when you upload to external sites, these sites often resave your image internally, and wipe your meta data. So again, the AI guys singing songs about how responsible they are for respecting the meta data is also hollow and useless.

What can we do to protect our art?

Kudurru.AI

A very promising option right now is kudurru.ai which is currently in its infancy. I don’t know very much about it yet as I’ve only just applied for beta access, but it’s a plugin for your site that actively blocks AI scrapers.

AI Scraper Tarpits

I’m editing this article to add this section on AI Scraper Malware like Nepenthes, Locaine, Quixotic, and likely others that will spring up from these. I was first made aware of these from Ars Technica article where they explain how webmasters tired of having to pay high bandwidth penalties because of AI scrapers ignoring robots.txt and mass scraping their sites, came up with these tools to catch the AI scrapers in an endless loop of gibberish. This potentially poisons the dataset, but at the very least wastes their resources.

Glaze (Anti AI) – Protect your Art Style

Another option is Glaze which is a free tool that overlays a distortion on your images to fool AI into thinking it looks like something different. If you’re interested I have a Glaze Review with comparisons of art before and after glazing:

Nightshade (Anti AI) – Poison the Dataset

The same team that created the defensive Glaze tool also created an offensive tool called Nightshade which will add data to your image that will poison a database that it’s added to after it gets scraped. I also have a Nightshade Review with before and afters:

Have I Been Trained Tool

HaveIBeenTrained.com is a tool that you can search to see if your work has been trained, and then mark what you find as ‘Do Not Train’. This apparently is a signal to scrapers to stay away.

Other Tools in the Fight Against AI

Other than that, you can support organizations that are actively fighting for our rights. I’ve made an extensive video about the dangers of generative art, along with links to all the relevant NGOs, legal precedence’s, etc, which you can see here.

Posted on: 17 July, 2023, 12:47 pm by Roxane Lapa