Web scraping tarpits are catching legitimate data teams, not just AI crawlers

Web scraping tarpits are catching legitimate data teams, not just AI crawlers(foura.ai)

4 points by angelhadjiev 77 days ago | 5 comments

Sites are deploying infinite fake-page mazes (Nepenthes, Locaine, etc.) to trap and poison AI training crawlers that ignore robots.txt. The motivation is understandable — Cloudflare reported 75% of AI web traffic in mid-2025 was training-related, and nearly 60% of reputable sites now block AI bots.

The problem: tarpits don't check intent. They detect automated request patterns. If your price tracker follows links systematically, skips JS execution, or hits pages at regular intervals — it looks identical to GPTBot. The trap fires anyway.

The collateral damage is real. One Rutgers/Wharton study found sites with aggressive crawler blocking saw a 23% drop in total traffic, including human visitors.

The escalation ladder is now at step 4: 1. robots.txt (gentleman's agreement) 2. User-agent filtering 3. Behavioural detection 4. Active tarpits — waste your compute, poison your data

If you're running any data pipeline at scale, you need to validate responses now. Tarpits serve plausible-looking Markov garbage. If you're not checking, it's already in your database.

Full writeup: https://foura.ai/blog/web-scraping-tarpits-collateral-damage

paulnpace 77 days ago |

Good to know that it's working.

One solution option left out: contact the hostmaster.

angelhadjiev 76 days ago | |

Fair point. Direct outreach works when you can identify who to contact and they’re responsive. In practice though, most data teams are scraping hundreds of domains, not one. The hostmaster path doesn’t scale, and tarpits often get deployed at the CDN/WAF layer (Cloudflare, Vercel) where there’s no meaningful human on the other end anyway.

Curious to know have you had success with that approach at scale, or more for one-off access agreements?

paulnpace 76 days ago | |

It looks like your handle is trolled, because your comments don't appear flagworthy, to me.

> The hostmaster path doesn’t scale

This IS the issue - destroying servers because it's inconvenient to coordinate with the administrators. Victory on the scraper end is temporary when disrespecting the people paying for the resources, especially since a lot of those resources have been made available by developers who become emotionally motivated to curtail the efforts of the scrapers.

> tarpits often get deployed at the CDN/WAF layer (Cloudflare, Vercel)

Cloudflare and others usually have exception options.

> Curious to know have you had success with that approach at scale, or more for one-off access agreements?

I'm tiny and only run little personal stuff. I just block vast IP address blocks. For example, blocking DO nearly eliminated all of the worst slop being sent to my servers. Similarly, I stopped serving on IPv6. I've read what other administrators are doing, and apparently there is something relatively easy to implement on Apache that blocks a lot of scrapers because DokuWiki was having scraper problems that were fixed by this method.

angelhadjiev 76 days ago | | |

;] Not a bot - just sleep-deprived. Spent last night chasing a tarpit at 2am.

You're right that scraping has a bad reputation (still, although it's one of the top topics on google words), and some of it is well-deserved.

The moral framing is fair in the training-crawler context, but the article's point is about collateral damage to legitimate use cases. Price comparison, research, public data pipelines... these aren't the bad actors, they just look like them.

That's the gap worth closing in my opinion.