Amazonbot is finally respecting robots.txt(xeiaso.net) |
Amazonbot is finally respecting robots.txt(xeiaso.net) |
At least, it claimed to be AmazonBot…
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
"x-forwarded-for":"44.210.204.255" "x-real-ip":"44.210.204.255"
This is a bit outside my area of expertise, so I don't know how reliable these x-forwarded-for and x-real-ip are.
BOT","cluster_name":"EU","cluster_region":"EU","connection_type":"corporate","country":"US","device_type":"ROBOT","duration_ms":0.391,"duration_us":391,"filter":"","ip":"52.1.106.130","isp":"Amazon.com, Inc.","level":"info","msg":"Request evaluated","org":"Amazon.com, Inc.","os":"","ref":"","region":"Virginia","result":false,"time":"2026-05-15T13:33:20Z","ua":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36","why":"bot"}
3.227.180.70
23.21.175.228
23.23.137.202
from all these IPs.
It's good that you mentioned this; smear campaigns are definitely not a new thing, and I suspect a lot of this DDoS'ing that's going on is a plot to accelerate towards Big Tech's authoritarian dystopia. Basically extortion.
I've also seen Google bots with AWS IP ranges. You gotta look at their ASN/ISP/ORG
They will in the future, but not today.
Did end up just adding them to our WAF blocklist, which is weirdly ironic - hosting on their infra & using their services to block their AI scraper...
Google only respected it because blocking Google from crawling your site used to hurt you more than it hurt Google.
this bit made me laugh. was the email drafted in Outlook? was it sent to some sort of forwarding mailbox, or did they just BCC every customer in?
I found a mention on some user agent trackers but no official documentation. Anyone knows if it’s documented? Asking because I am seeing decent traffic (30GB/week) from this.
> Crawling behavior [...] Crawler identification: Identifies itself with user-agent string "aws-quick-on-behalf-of-<UUID>" in request headers.
Maybe people found a way of using it as a loophole for something or Amazon Quick is just picking up in usage, and your website is popular amongst whoever uses that sort of stuff.
It has AI agents included so I guess this can just come from it searching the web based on user requests.
> Amazonbot is used to improve our products and services. This helps us provide more accurate information to customers and may be used to train Amazon AI models.
They've been getting some heat on it lately, but I find it hard to believe they're going to give up entirely? And if so, what's to stop someone from just flouting their rules on pricing, and then doing the robots.txt thing to prevent issues?
By that, I mean the types of crawls that can hog up significant usage.
The traffic isn't a problem. I've got Cloudflare in front and the machine itself is relatively overpowered, and downtime isn't critical. But I'd just like the thing to be able to spider me properly. Someone did point out to me that maybe I wasn't receiving actual Amazonbot but some other spider: https://news.ycombinator.com/item?id=46352723
Cloudflare had a nice technic to address the bot problem (if you use their name servers). It'll respect and use the robots.txt while sending the remaining bots to a deep black hole.
That said, one of the biggest websites in the world not respecting it is definitely a noteworthy story. Hopefully another one of the biggest websites in the world (formerly known as Twitter) eventually respects it as well instead of not even disclosing itself via a user agent and pretending to be Safari running on iOS.
You're talking about one (yes, biggest) but millions of other bots don't follow must be a bigger story.
The people trying to use it to block or limit bots are uninformed and/or misinformed.
It would be rather nifty if Amazon and other companies would confine AI to specific CIDR or a dedicated ASN but I would not hold my breath on that one. AI crawlers will likely muddy the waters for everyone else.
If a CDN does not have an option to block cloud and Tor CIDR blocks then that should be a feature request.
44.210.204.255 is included in 44.192.0.0/10 which is listed in the AWS CIDR ranges. Use one of the online subnet calculators to find IP ranges of CIDR blocks. This is likely a Tor exit node.
Blocking the CIDR blocks I listed in the thread would have included this node as well. Here [3] are a few shell functions for getting some of the cloud CIDR blocks. I must have been inebriated when I wrote those. This site may not be reachable during blood moons or when the nanosecond is divisible by zero.
Here [4a][4b] are a couple decent subnet calculators. There are some command line tools for playing with CIDR blocks and IP addresses to see if an IP is included in a CIDR block but this varies by Linux distribution so perhaps look for a generic python script.
To get a list of Tor exit nodes to blackhole route, look at [5]. This updates often. Just clone the entire repo. Unless your site is related to government dissent or anonymous porn then most traffic from Tor exit nodes will likely just be bots and thus riff-raff.
Seconds after I linked realhackers bots showed up and got a zero byte response. Poor lil HN servers must get a lot of trash non stop. I hope I get some delicious bots today.
[1] - https://bgp.tools/
[2] - https://bgp.tools/as/14618
[3] - https://ai.realhackers.org/_get_cloud_cidr.txt
[4a] - https://mxtoolbox.com/subnetcalculator.aspx
[4b] - https://www.vultr.com/resources/subnet-calculator/
[5] - https://github.com/firehol/blocklist-ipsets/blob/master/clea...
- play the game of whack-a-mole
- use difficult implementations of user validation checks that potentially cause pain for real humans
- block all Amazon CIDR blocks which they know most corporations will not do.
This forces the majority to just tolerate whatever comes out of their networks.