Robots.txt Disallow: 20 Years of Mistakes To Avoid(beussery.com) |
Robots.txt Disallow: 20 Years of Mistakes To Avoid(beussery.com) |
User-agent: ia_archiver
Disallow: /
Those two lines mean that all content hosted on the entire site will be blocked from the Internet Archive (archive.org) WayBack Machine, and the public will be unable to look at any previous versions of the website's content. It wipes out a public view of the past.Yeah, I'm looking at you, Washington Post: http://www.washingtonpost.com/robots.txt
Banning access to history like that is shameful.
http://www.quora.com/robots.txt
Here is their explanation (in the robots.txt file)
"We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it. As far as we can tell, there is no way for sites to selectively programmatically remove content from the archive and so this is the only way for us to protect writers. If they open up an API where we can remove content from the archive when authors remove it from Quora, but leave the rest of the content archived, we would be happy to opt back in. See the page here: https://archive.org/about/exclude.php"
"Meanwhile, if you are looking for an older version of any content on Quora, we have full edit history tracked and accessible in product (with the exception of content that has been removed by the author). You can generally access this by clicking on timestamps, or by appending "/log" to the URL of any content page."
Authors don't get the right to go around removing their novels from public libraries just because they would rather the books be available only for pay in bookstores.
Why do you think it is legal to then go ahead and slurp it?
We do, however, have the right to criticize people who ban IA from their site.
If it's on their bandwidth and power, why not?
And people wonder why alternative search engines have such a hard time taking off.
Google is somewhere between 50-90% of most sites' search referrals (source: /dev/ass). Add in a handful of other search engines (Bing, DDG, Yahoo, Ask) and you've pretty much got all of it.
They're maybe 10-20% of your crawl traffic though. And possibly a lot less than that.
There are a TON of bots out there. If you're lucky, they just fill your logs and hammer your bandwidth.
If you're not so lucky, they break your site search, overload your servers, and if you're particularly unlucky, they wake you up with 2:30 am pages for two weeks straight.
At which point the simplest way to solve the technical problem, that is, you getting a full night's sleep, is to ban every last fucking bot but Google. Or maybe a handful of the majors.
Now, of course, you're a data-driven operation and you're relying on Google Analytics to tell you who's sending traffic your way. But if you block a search crawler, it's going to stop sending you traffic, so you won't know it's important.
It's a rather similar set of logic that drives people to set email bans on entire CCTLDs or ASN blocks for foreign countries. And if you're a smallish site, it's probably a decent heuristic. And no, it's not just fucking n00bs who do this. Lauren Weinstein who pretty much personally birthed ARPANET at UCLA was bitching on G+ just a week or so back that the new set of unlimited TLDs ICANN were selling were rapidly going into his mailserver blocklists. Because, of course, the early adoptors of such TLDs tend to be spammers, or at least, the early adopters he's likely to hear from.
https://plus.google.com/114753028665775786510/posts/SsgPNHLG...
"Some sites try to communicate with Google through comments in robots.txt"
In the examples given, none appear to be trying to "communicate with Google through comments" - how is including...
# What's all this then?
# \
#
# -----
# | . . |
# -----
# \--|-|--/
# | |
# |-------|
...a "mistake" to avoid? There's no harm in it at all.I thought that was the whole point of robots.txt
lots of target-detection crawlers will look at robots.txt as the first thing they do to see if there's any fun pages you don't want the other crawlers to see
That said, obscurity is not really security. Your admin pages should be behind a password, which, if coded properly, will exclude spiders, bots, and bad guys.
Alongside tagging links to such resources with nofollow.
The robots exclusion protocol is a ridiculous anachronism. I don't use it and neither should you.
Spiders have to be robust against sites with unlimited numbers of internal links anyway, or else an attacker could trap a web spider with a malicious site, or a 13 year old writing a buggy PHP add could take down Google's entire spidering system.
the article is pretty much correct (although strangely worded at some times), the stuff about "communicating via robotst comments to google" is of course not true. the example he gives are developer jokes, nothing more.
still, you should not use comments in the robots.txt, why?
you can group user agents i.e.:
User-agent: Googlebot
User-agent: bingbot
User-Agent: Yandex
Disallow: /
Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)ok, now:
User-agent: Googlebot
#User-agent: bingbot
User-Agent: Yandex
Disallow: /
so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it's own group and yandex it's own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it's interpredted are as non existent, they belong together.they way i read the spec https://developers.google.com/webmasters/control-crawl-index..., this behaviour is undefined. (pleae correct me if i'm wrong)
simple solution: don't use comments in the robots.txt file.
also, please somebody fork and take over https://www.npmjs.org/package/robotstxt it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).
by the way, my recommendation is to have a robots.txt file like this
User-agent: *
Dissalow:
Sitemap: http://www.example.com/your-sitemap-index.xml
and return HTTP 200why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it's quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txt
also read the spec https://developers.google.com/webmasters/control-crawl-index...
GAH!! So it's you who writes those horrible sites?
I want to be able to middle click on two different URLs and browse two pages with completely different state at the same time.
I HATE sites that store state in cookies, the two different tabs start getting completely mixed up about where I am in the site.
The only thing that should be in a cookie is stuff like a shopping cart. But that's only because the action "add to cart" is like a transaction and should be remembered.
Viewing a page and changing the sort is ephemeral and should have no effect on anything else.
> Spiders have to be robust
Who cares about the spider? What about your site that got hit with an unending stream of completely useless page views?
Your position about robots.txt is simply wrong and you need to change your mind.
Politicians keep attempting to write evermore draconian qualifications and punishments into law for what qualifies as a "breach of terms of service". I would expect this to encompass robots.txt at some point if it does not already.
Again, I'm not particularly happy about this trend, but I'll try to keep out of its path of destruction.
If you're in the business of providing public content that's well-known, to the public, then allowing it to be archived makes a lot of sense.
If you're providing user-generated content I'd argue that the case for allowing archival is extended even more so. Sites that violate this, and Quora comes specifically to mind, are violating what many, myself included, consider to be part of the social contract of the Web.
On the other hand, if you're an individual, and you are posting your own content and ramblings, and circumstances change for whatever reason: you've got a job, you've lost a job, you're married, you're divorced, you're getting divorced, your child is at war in a foreign country, a foreign country is at war with yours, or you're just sick of the crap you wrote when you were young and arrogant and now and old and arrogant you wants it gone: I'm pretty willing to grant you that right.
If you've committed some terrible crime against humanity, or just a human, and have been fairly tried and convicted of it, I'd probably not give you the right to remove large bits of that information.
And yes, there are vast fields, deserts, tundras, plains, steppes, ice-fields, and oceans of grey about all of this.
Barbra Streisand got Streisanded because she is Streisand.
Ahmed's Falafel Hut likely wouldn't suffer the same fate. His Q-score is somewhat lower, and there's only so much real estate in the public consciousness.
Also, if people become aware someone is naive they automatically are dicks to them? Regardless of wether that information is actually of interest to anyone, just because someone wants to take something down, they should not ever be able to?
Some people act and think like that that, yeah. But to accept this as the baseline of human behaviour is, well, not for me. This entitledness to watch the lives of others from the the dark may have been bred by reality TV or whatever; but it's more a personality flaw and an addiction, a useless misfiring of synapses become culture, than a cornerstone of an information age.
If they have fears about losing revenue - and although I find them silly - there are other ways of going about it, such as only allowing access to pages some weeks or months after they've been published.
Civility is a good keyword, and while this may be a bit of a stretch, imagine sitting in public cafés and writing down what people say, and then criticizing people for lowering their voice and turning their back so you can't read their lips, even though you genuinely mean well, and just want to preserve daily public life for future historians. In general, this is what this attitude of "the internet" feeling entitled to whatever was ever posted anywhere feels to me. Maybe I just don't get it, but I really don't get it.
I think the question wether a private conversation should be recorded just because it's in public, just because you can, is kind of a no-brainer, but here are ones I don't have an answer to: Should an artist be allowed to make a performance and ask it to not be recorded? Should someone be able to hold a political speech and ask the same? For me the answers are kind yes, and no-ish... but what about political art? Are we allowed to try to influence people, and then try to erase the traces? Now that is tricky, and I may have ended up ranting myself into agreeing more with the IA "side" of the argument than I expected to. Because either something is personal, trivial in one way or another, or commercial and/or political. Personal things I think should be respected, but commercial and political things shouldn't be, they do belong to historians. Well, fuck.
[this is why I "blog" bit, actually -- because posting stuff online makes me think harder about them than I would otherwise, I don't even need an actual audience for that, just the possibility of one -- but that's also why I don't feel great about all of that floating around forever, it's all rather temporary in nature, a process.. and the person who wrote stuff a year ago does not exist anymore, so why should the name of this current person be attached to it?]
Perhaps individual private websites, such as pekk's, should have the right to say "No, you do NOT have the right to pound my site with requests and serve data that I decided to pull down."
However, in theory, the Washington Post's articles online are also (eventually) placed on microfiche. Saying there's no right to serve data that WP decided to pull down would in some sense require WP to "steal into libraries in the night and set fire to their microfiche collection".
I'd say my general rule is closer to: if you allow search engines, you should allow IA.
W.r.t. your last question, historically the solution to that problem was simple and elegant: people used pen names to write what they didn't want to bind permanently to them. This way is also safer from unauthorized archiving - not everyone is as respectful of the author's wishes as IA.
So sometimes an IA-friendly domain expires (e.g. accidentally or because its owner died), a squatter buys the domain, and the squatter points the domain at a junk-site landing server with a deny all robots.txt. The result is truly disastrous: IA removes access to the historic, IA-friendly site. Site acquirers who do this deliberately are pure evil.
The Internet Archive does wonderful work, but just because somebody doesn't want you folks crawling their content doesn't make them worthy of "naming and shaming"
A physical library is either getting their newspapers by asking/paying the newspaper company to deliver them, asking citizens to donate them, or collecting them from already delivered newspapers. If the IA was just piggybacking on user activity (by caching and storing things from a user's browser cache after they visit a page) then I'd have far less of a concern with them. If we're so attached to physical metaphors, this would be equivalent to the librarian running around outside the newspaper's printing room and snatching newspapers from the bundles as the company's employees loaded them onto trucks.
When IA stops wiping out historical content due to a change of domain ownership in the now then I will have more support (and USE) for them.
IA is on iffy territory w.r.t. copyright as it is; if they stop respecting robots.txt, they could get into a world of hurt.
They're also non-commercial, broad in scope, arguably serve a valuable scholarly function and have other characteristics that have kept them mostly out of legal hot water. But it's unclear to what degree they're legally different from a site that decided to create an archive of all comics, commercial and otherwise, and slap advertising up.