AI-Shunning robots.txt

44 points by glynnormington 2 years ago | 90 comments

nerdjon 2 years ago |

I am curious, do we have any evidence that AI is adhering to robots.txt and isn’t ignoring it since they are not technically crawling in the traditional sense?

Even if they are right now it would be a quick switch for them to just ignore it.

omoikane 2 years ago | |

I have examples in my logs of GPTBot fetching only /robots.txt, and nothing from the same /24 block fetched anything else after that, so it seems at least that bot respects robots.txt.

Maybe your question is "how do we know if whatever system GPTBot feeds downstream didn't just get your content via something else that crawl your site?" I am not sure we have anything to defend against those, other than signalling via robots.txt to say that our content is not intended for AI use.

mrkramer 2 years ago | |

Internet Archive's crawler is not respecting robots.txt because they want to archive everything not just parts of the Web. But if you are actively breaking robots.txt then your crawler will have a bad reputation and you will have an army of webmasters trying to block your crawler by any means. You can see crawling requests in your sever logs, that's how you know if they are respecting it or not.

Imo, they best solution would be to license your content so crawlers pay a fee for crawling and using your content.

nerdjon 2 years ago | | |

Well TIL that IA does not respect robots.txt.

Does IA themselves block crawlers? It doesn't look like it according to their robots.txt, even going so far as to say "Please crawl our files."

What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

andybak 2 years ago | |

This is about crawling for training data by the look of things. Not sure if the CHatGPT browsing mode uses a different user-agent but most of the entries in that list look like crawlers.

nerdjon 2 years ago | | |

I had assumed this is related to sites like chatgpt going out and searching with a specific request.

Regardless, my original question is still valid. The companies have already shown a lack of care about the data they train off of. So if ethics have already gone out the window, what is to stop them from ignoring this file if they are not already.

vouaobrasil 2 years ago |

Nice. Let's all contribute to this...ideally, web-hosts should provide this sort of thing by default so we can starve AI companies from training data and combine it with other strategies to put them out of business for good.

andybak 2 years ago | |

How about AI from non-companies? Or genuinely non-profit or open projects?

Also - out of curiosity - do you use any AI yourself?

vouaobrasil 2 years ago | | |

> How about AI from non-companies? Or genuinely non-profit or open projects?

AI from any project will allow AI to be used commercially, and thus I oppose it. Moreover, I oppose AI on various other princincples even independent of this: it further isolates people and can be used to develop other technologies that are too powerful for us to handle. In short, I believe human beings en mass are too stupid to use AI.

> Also - out of curiosity - do you use any AI yourself?

I do not, or at least I try my best not too. In fact, I hate AI with a passion. Obviously, there may be products here and there that have used AI that I in turn use. What can you do? But I attempt to minimize any contact I have with AI: I don't use Grammarly, any form of auto-suggest, I use an ancient phone (and I RARELY use it, I hate smartphones), I don't use AI features in software such as AI-noise reduction, I turn off all automatic features in software that may have some AI behind it.

If I find out a website uses AI for content generation, I ban it and never visit again.

The other day I downloaded a text editor that looked cool but I deleted it because I realized it has an AI-console (even though I never used it).

I also work for a business and I convinced them not to use AI. We're an online magazine and it turns out the vast majority of our readers supported that decision.

In short, I am against AI because I believe it provides virtually no benefits to humanity, only detriments.

jddj 2 years ago |

The named source, https://darkvisitors.com, is interesting.

gavinhking 2 years ago | |

I made this, let me know if you have questions or feedback.

tbeseda 2 years ago | | |

Thanks for the work on this!

I automated my site's robots.txt[0] by scraping your site. It would be extra nice if darkvisitor.com exposed a plain text version or JSON representation of the list.

[0] https://tbeseda.com/blog/automating-my-robots-txt-to-block-a...

cabirum 2 years ago |

The crawlers can simply stop identifying themselves via custom user agent, can't they?

Also why are "AI" crawlers are worse than "normal" crawlers?

Either way, this is an exercise in futility.

vouaobrasil 2 years ago | |

> Either way, this is an exercise in futility.

Is it really? Every drop of opposition towards AI in my book is a good thing. This robots.txt thing is a small drop maybe, but over time public hatred for AI can build and it might in fact be taken down. Especially outside the tech bubble, many people are ambivalent towards AI.

Yes, in modern society were are taught to value innovation and ignore its downsides, but the more vocal opponents are against it, the more those downsides will become apparent. Hopefully, it will bring the ruin of all AI companies and research.

cabirum 2 years ago | | |

I'm kind of out of the loop in regard why do we need to hate on AI? The bubble will burst given time, like all the other bubbles before.

What needed is indifference, not hate.

Wissenschafter 2 years ago | | |

Only on hackernews you get the most ironic of takes. Supposedly someone who is educated and technologically literate to a high degree, thinks opposition to AI is a good thing.

Crazy world.

karaterobot 2 years ago | |

> Also why are "AI" crawlers are worse than "normal" crawlers?

A search engine will index your content to bring people to it through search. An AI crawler will take your content to recapitulate it and sell it to others. Obviously it's more complicated than this, but this is how one might see it who wishes to use this file.

> Either way, this is an exercise in futility.

Not necessarily disqualifying. Laws against theft are also futile, in the sense that honest people don't need them and dishonest people don't follow them, and history since at least Hammurabi has been replete with examples of such laws not stopping theft. And yet. Seems worth the calories it costs to say "for the record, I do not give my consent for what you're doing".

cabirum 2 years ago | | |

Search engines are not the beacons of holiness - they sell ads, they sell data on who searched what, they manipulate results.

Search engines and AI things are typically owned by the same company. AIs are fed with the data collected by a search engine. The only difference is whether AI gets the data in realtime or waits for the search engine to collect another data dump.

Fighting windmills as I see it.

belter 2 years ago |

Or redirect them to poisoned material?

vouaobrasil 2 years ago | |

That is a good idea. Maybe redirect them to massive datasets to cause the company mass embarrassment. There are already some image-modifying programs that generate poison images, and the bots could be redirected to such images...

internetter 2 years ago |

This is missing a couple, one that comes to mind is `FriendlyCrawler`, which is most definitely not friendly, and very likely for AI

glynnormington 2 years ago | |

Feel free to submit a PR. :-)

andybak 2 years ago |

As someone who uses and benefits from the results of AI crawlers, I would only want to block crawls under very specific circumstances.

I would back a general move to block crawlers from non-open models (whatever that means and if such a thing was practical) as it might be a strong lever to encourage good behaviour.

rocky_raccoon 2 years ago |

Not that I'm arguing for or against preventing access from AI crawlers, but wouldn't it make more sense to block them at a higher level, e.g. the webserver, and not even give them the choice to obey/disobey robots.txt?

rideontime 2 years ago | |

How would you propose doing so?

rocky_raccoon 2 years ago | | |

Off the top of my head:

- Cloudflare

- Webserver-level user-agent blocking (Apache, nginx)

- Application-level user-agent blocking (`if request.user_agent == 'OpenAI'`)

None of them are ideal since you can simply change your user agent, but all of them seem like better options than robots.txt to me.

adrianN 2 years ago | | |

We could repurpose the evil bit.

gtirloni 2 years ago | | |

Web servers can check the user-agent and block the request.

E.g. nginx $http_user_agent

CalRobert 2 years ago |

Given how intertwined AI and search engines are it's hard to see how this helps aside from _maybe_ making things easier for Google, Microsoft, etc., unless you also don't want to be indexed by search engines.

bakugo 2 years ago |

This makes complete sense because, as we all know, AI companies are very concerned with respecting the rights of the people they steal data from, and totally won't just ignore this.

frizlab 2 years ago | |

At least you show intent and can then potentially prove they are not respecting your wishes. It’s better than doing nothing.

natch 2 years ago |

We need AIs to know more, not less. If many people block AIs from reading their sites, AIs will just be stuffed with biased information from people pushing agendas.

nerdjon 2 years ago | |

So the value of them will plummet? That sounds like a win for society.

natch 2 years ago | | |

Why would the value of AIs plummet if they know more?

Or did you mean sites? Information wants to be free.

If AI is trained only on data provided by those with agendas, you won’t want to live in that world.