The problem: tarpits don't check intent. They detect automated request patterns. If your price tracker follows links systematically, skips JS execution, or hits pages at regular intervals — it looks identical to GPTBot. The trap fires anyway.
The collateral damage is real. One Rutgers/Wharton study found sites with aggressive crawler blocking saw a 23% drop in total traffic, including human visitors.
The escalation ladder is now at step 4: 1. robots.txt (gentleman's agreement) 2. User-agent filtering 3. Behavioural detection 4. Active tarpits — waste your compute, poison your data
If you're running any data pipeline at scale, you need to validate responses now. Tarpits serve plausible-looking Markov garbage. If you're not checking, it's already in your database.
Full writeup: https://foura.ai/blog/web-scraping-tarpits-collateral-damage
One solution option left out: contact the hostmaster.
Curious to know have you had success with that approach at scale, or more for one-off access agreements?
> The hostmaster path doesn’t scale
This IS the issue - destroying servers because it's inconvenient to coordinate with the administrators. Victory on the scraper end is temporary when disrespecting the people paying for the resources, especially since a lot of those resources have been made available by developers who become emotionally motivated to curtail the efforts of the scrapers.
> tarpits often get deployed at the CDN/WAF layer (Cloudflare, Vercel)
Cloudflare and others usually have exception options.
> Curious to know have you had success with that approach at scale, or more for one-off access agreements?
I'm tiny and only run little personal stuff. I just block vast IP address blocks. For example, blocking DO nearly eliminated all of the worst slop being sent to my servers. Similarly, I stopped serving on IPv6. I've read what other administrators are doing, and apparently there is something relatively easy to implement on Apache that blocks a lot of scrapers because DokuWiki was having scraper problems that were fixed by this method.
You're right that scraping has a bad reputation (still, although it's one of the top topics on google words), and some of it is well-deserved.
The moral framing is fair in the training-crawler context, but the article's point is about collateral damage to legitimate use cases. Price comparison, research, public data pipelines... these aren't the bad actors, they just look like them.
That's the gap worth closing in my opinion.