Facebook's robots.txt(facebook.com) |
Facebook's robots.txt(facebook.com) |
Uh oh... Something didn't work. > http://disqus.com/human.txt
Tried to curl it, exact content, no 302 towards a "<script>window.close</script>",... Got anything?
Not that you should do that. Robots.txt is a nicety though, the client doesn't have to respect it, and the server doesn't have to allow your HTTP requests.
robots.txt is basically a list of rules that lay out "This is how we'd like you to crawl us. We might stop serving you if you don't comply", rather than a hard-and-fast set of directives that specify how a webcrawler will be guaranteed to behave.
User-agent is to easily spoofed, but we could check if the robots are indeed Google (whitelisted) and not some other crawler that just wants to scrape your content.
In the realm of mail servers we have something called SPF: http://en.wikipedia.org/wiki/Sender_Policy_Framework
Just thinking out of the box here, but other than checking IP ranges: Maybe a hash being sent as a header inside the GET request by the crawler to verify if they are who they say they are.