Robots.txt Disallow: 20 Years of Mistakes To Avoid

Robots.txt Disallow: 20 Years of Mistakes To Avoid(beussery.com)

106 points by hornokplease 12 years ago | 60 comments

Asparagirl 12 years ago |

This article forgot the very worst use of robots.txt:

  User-agent: ia_archiver
  Disallow: /

Those two lines mean that all content hosted on the entire site will be blocked from the Internet Archive (archive.org) WayBack Machine, and the public will be unable to look at any previous versions of the website's content. It wipes out a public view of the past.

Yeah, I'm looking at you, Washington Post: http://www.washingtonpost.com/robots.txt

Banning access to history like that is shameful.

bufordsharkley 12 years ago | |

The thing that really frustrates me about the Internet Archive's treatment of robots.txt: if a domain expires and the domain provider changes the robots.txt to something restrictive, the Wayback Machine will completely clear the history of the site. Even though it's very clearly not the same agent at play-- this is not the creator of the site's content. I've seen it happen, and it breaks my heart every time.

pestaa 12 years ago | | |

Why wouldn't it consider the archived state of robots.txt?

blueskin_ 12 years ago | | |

One of the reasons I like archive.today. Obviously, they lack the depth of history, but they don't censor so easily.

adventured 12 years ago | |

Quora also disallows the ia_archiver agent.

http://www.quora.com/robots.txt

Here is their explanation (in the robots.txt file)

"We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it. As far as we can tell, there is no way for sites to selectively programmatically remove content from the archive and so this is the only way for us to protect writers. If they open up an API where we can remove content from the archive when authors remove it from Quora, but leave the rest of the content archived, we would be happy to opt back in. See the page here: https://archive.org/about/exclude.php"

"Meanwhile, if you are looking for an older version of any content on Quora, we have full edit history tracked and accessible in product (with the exception of content that has been removed by the author). You can generally access this by clicking on timestamps, or by appending "/log" to the URL of any content page."

krapp 12 years ago | |

The Internet Archive choosing to honor robots.txt is what's 'banning' the access. Both the request not to be crawled and the decision not to crawl are voluntary, but if the Internet Archive decided it wanted to slurp up the Washington Post tomorrow, there's not much they could do to stop it.

Asparagirl 12 years ago | | |

Yes, I know, I'm a member of Archive Team, and I use "wget -e robots=off --mirror …" quite a bit, and then I upload those WARC's to the IA. But major content providers like the Washington Post that explicitly choose to block their entire website and its history should be named and shamed.

Authors don't get the right to go around removing their novels from public libraries just because they would rather the books be available only for pay in bookstores.

DanBC 12 years ago | | |

They have explicitly denied permission to have their content slurped.

Why do you think it is legal to then go ahead and slurp it?

pekk 12 years ago | |

No, you do NOT have the right to pound my site with requests and serve data that I decided to pull down.

icebraining 12 years ago | | |

Nobody said they do; nobody said the Internet Archive shouldn't respect robots.txt.

We do, however, have the right to criticize people who ban IA from their site.

moe 12 years ago | | |

I may not have a right to "pound" your site, but I certainly have a right to keep whatever I find on your public webserver, regardless of whether you decide to pull it down later.

teacup50 12 years ago | | |

Do you also want to steal into libraries in the night and set fire to their microfiche collection?

angersock 12 years ago | | |

"serve data that I decided to pull down."

If it's on their bandwidth and power, why not?

TheLoneWolfling 12 years ago |

What frustrates me is the number of websites that impose additional restrictions on anything they don't recognize, or worse, websites that impose additional restrictions on (or worse yet, just outright ban) anything that isn't Googlebot.

And people wonder why alternative search engines have such a hard time taking off.

dredmorbius 12 years ago | |

I can give you a really simple operational reason for that: complexity.

Google is somewhere between 50-90% of most sites' search referrals (source: /dev/ass). Add in a handful of other search engines (Bing, DDG, Yahoo, Ask) and you've pretty much got all of it.

They're maybe 10-20% of your crawl traffic though. And possibly a lot less than that.

There are a TON of bots out there. If you're lucky, they just fill your logs and hammer your bandwidth.

If you're not so lucky, they break your site search, overload your servers, and if you're particularly unlucky, they wake you up with 2:30 am pages for two weeks straight.

At which point the simplest way to solve the technical problem, that is, you getting a full night's sleep, is to ban every last fucking bot but Google. Or maybe a handful of the majors.

Now, of course, you're a data-driven operation and you're relying on Google Analytics to tell you who's sending traffic your way. But if you block a search crawler, it's going to stop sending you traffic, so you won't know it's important.

It's a rather similar set of logic that drives people to set email bans on entire CCTLDs or ASN blocks for foreign countries. And if you're a smallish site, it's probably a decent heuristic. And no, it's not just fucking n00bs who do this. Lauren Weinstein who pretty much personally birthed ARPANET at UCLA was bitching on G+ just a week or so back that the new set of unlimited TLDs ICANN were selling were rapidly going into his mailserver blocklists. Because, of course, the early adoptors of such TLDs tend to be spammers, or at least, the early adopters he's likely to hear from.

https://plus.google.com/114753028665775786510/posts/SsgPNHLG...

dredge 12 years ago |

The article contains some good observations, but I'm struggling to understand this one:

"Some sites try to communicate with Google through comments in robots.txt"

In the examples given, none appear to be trying to "communicate with Google through comments" - how is including...

  # What's all this then? 
  #   \
  # 
  #    -----
  #   | . . |
  #    -----
  #  \--|-|--/
  #     | |
  #  |-------|

...a "mistake" to avoid? There's no harm in it at all.

Istof 12 years ago | |

"Some sites try to communicate with Google through comments in robots.txt"

I thought that was the whole point of robots.txt

lmm 12 years ago | | |

No, the point is to communicate with Google through non-comments in robots.txt.

SoftwareMaven 12 years ago | |

I don't think those examples were of people trying to communicate with a crawler. I think they were examples of comments that the owners knew would be thrown away by crawlers.

freddielarge 12 years ago |

fun fact: robots.txt can also be used by attackers to find admin interfaces or other sensitive tidbits that you don't want search engines to crawl

lots of target-detection crawlers will look at robots.txt as the first thing they do to see if there's any fun pages you don't want the other crawlers to see

snowwrestler 12 years ago | |

If you want to hide admin pages, add the robots meta tag to each one and set noindex, nofollow. Then you don't need to list them all in one place in robots.txt.

That said, obscurity is not really security. Your admin pages should be behind a password, which, if coded properly, will exclude spiders, bots, and bad guys.

spaulo12 12 years ago |

In the past I've created an empty robots.txt just to keep the 404 errors out of my logs...

sp332 12 years ago |

Why does Google ignore the crawl delay?

sbierwagen 12 years ago | |

Google has millions of spiders, in datacenters all over the world. Maybe respecting crawl delay added more shared-state overhead than they wanted.

jhwhite 12 years ago | |

I haven't read the entire article but we were discussing this at work a few weeks ago. You can set the crawl delay in Google Web Master Tools but they only adhere to that setting for 90 days then they go back to their default.

pipihu 12 years ago |

The main use for robots.txt is to prevent crawling of infinite URL spaces: http://googlewebmastercentral.blogspot.com.br/2008/08/to-inf...

Alongside tagging links to such resources with nofollow.

ashmud 12 years ago | |

Back in the day, I would use httrack for offline web browsing, and these were a constant irritation.

sbierwagen 12 years ago |

My server returns 410 GONE to robots.txt requests.

The robots exclusion protocol is a ridiculous anachronism. I don't use it and neither should you.

ars 12 years ago | |

And what do you do about sites with an infinite number of pages?

sbierwagen 12 years ago | | |

By not writing bad software. State shouldn't be stored in URLs, it should be stored in cookies.

Spiders have to be robust against sites with unlimited numbers of internal links anyway, or else an attacker could trap a web spider with a malicious site, or a 13 year old writing a buggy PHP add could take down Google's entire spidering system.

franze 12 years ago |

yeah, robots.txt is a horrible standard. trust me, i wrote https://www.npmjs.org/package/robotstxt just so that i can really understand what is going on. it's based on https://developers.google.com/webmasters/control-crawl-index...

the article is pretty much correct (although strangely worded at some times), the stuff about "communicating via robotst comments to google" is of course not true. the example he gives are developer jokes, nothing more.

still, you should not use comments in the robots.txt, why?

you can group user agents i.e.:

    User-agent: Googlebot
    User-agent: bingbot
    User-Agent: Yandex
    Disallow: /

Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)

ok, now:

    User-agent: Googlebot
    #User-agent: bingbot
    User-Agent: Yandex
    Disallow: /

so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it's own group and yandex it's own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it's interpredted are as non existent, they belong together.

they way i read the spec https://developers.google.com/webmasters/control-crawl-index..., this behaviour is undefined. (pleae correct me if i'm wrong)

simple solution: don't use comments in the robots.txt file.

also, please somebody fork and take over https://www.npmjs.org/package/robotstxt it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).

by the way, my recommendation is to have a robots.txt file like this

    User-agent: *
    Dissalow: 

    Sitemap: http://www.example.com/your-sitemap-index.xml

and return HTTP 200

why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it's quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txt

also read the spec https://developers.google.com/webmasters/control-crawl-index...

blueskin_ 12 years ago |

There are enough malicious bots that do follow robots.txt to make it still an important option for most sites.

Istof 12 years ago |

500kb limit? you call that short and sweet?