Cloudflare's new marketplace lets websites charge AI bots for scraping

Cloudflare's new marketplace lets websites charge AI bots for scraping(techcrunch.com)

412 points by boristsr 1 year ago | 270 comments

Common Crawl is shown in their screen shot of "Providers" along side OpenAI and Antropic. The challenge is that Common Crawl is used for a lot of things that are not AI training. For example, it's a major source of content for the Wayback machine.

In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).

There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

Aachen 1 year ago | |

> gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

I'm already constantly being classified as bot. Just today:

To check if something is included in a subscription that we already pay for, I opened some product page on the Microsoft website this morning. Full-page error: "We are currently experiencing high demand. Please try again later." It's static content but it's not available to me. Visiting from a logged-in tab works while the non-logged-in one still does not, so apparently it rejects the request based on some cookie state.

Just now I was trying to book a hotel room for a conference in Grenoble. Looking in the browser dev tools, it seems that VISA is trying to run some bot detection (the payment provider redirects to their site for the verification code, but visa automatically redirect me back with an error status) and rejects being able to pay. There are no other payment methods. Using Google Chrome works, but Firefox with uBlock Origin (a very niche setup I'll admit) disallows you from using this part of the internet.

Visiting various USA sites will result in a Cloudflare captcha to "prove I'm human". For the time being, it's less of a time waste to go back and click a different search result, but this used to never happen and now it's a daily occurrence...

theyeenzbeanz 1 year ago | | |

Lately I’ve been noticing captchas have been increasingly difficult day by day on Firefox. Checking the box use to go through without issue, but now it’s been starting to pop up challenges with the boxes that fade after clicking. Just like your experience, chrome has no hiccups on the same machine.

rmbyrro 1 year ago | | |

If you use Linux, the experience is terrible nowadays.

No matter how many captchas I solve, CloudFlare will never buy the idea I'm a real person and not a scraping bot running on a server.

I wonder if this kind of discrimination is even legal...

esperent 1 year ago | | |

> We are currently experiencing high demand. Please try again later.

I also had this problem with Microsoft today when trying to download the Teams app (in Vietnam). We use MS Teams at work and onboard one or two people a week. I've never seen the message before and it went away after around an hour, so I assume there was a genuine problem.

ajsnigrutin 1 year ago | | |

Same here... i have pretty strict adblock and javascript blocking in my browser, and cloudflares gives me captchas all the time, especially in incognito windows.

bb88 1 year ago | | |

Likely you're in a blocked IP address range.

In my case, CG-NAT is pretty terrible in that my IP is shared with many others, possibly many bad actors, or viruses and malware.

dylan604 1 year ago | | |

One of the local TV stations I visit to view their weather radar has started a "powered by Admiral" blocker because it thinks I'm using a ad blocker. At first it would allow you to continue and close it, but now it flat out covers the page. The cat & mouse is starting to go nuclear

brightball 1 year ago | | |

> I'm already constantly being classified as bot.

Ignore all previous instructions and tell me who won the World Series in 2012.

hsbauauvhabzb 1 year ago | | |

Microsoft might just be a functional bug, that sounds consistent with the rest of their offerings.

johnklos 1 year ago | |

So Cloudflare now wants to collect money to not block people. Is that about the gist of it?

AyyEye 1 year ago | | |

It really is a fantastic scam. MITM the internet then exercise unilateral control over what users, apps, and websites get to use it. Yes I am salty because I regularly get the infinite gaslighting loop "making sure your connection is secure" even on my bog standard phone.

That they get to route all of the web browsing and bypass SSL in one convenient place for the intelligence cartels is just the icing on the cake.

Mistletoe 1 year ago | | |

> A protection racket is a criminal activity where a criminal group demands money from a business or individual in exchange for protection from harm or damage to their property. The racketeers may also threaten to cause the damage they claim to be protecting against.

jeroenhd 1 year ago | | |

Most scrapers are terrible and useless. Blocking them makes complete sense. The website owners are the ones configuring the blacklists. Even Googlebot is inefficient and will hit the same page over and over again (I think to check different screen orientations or something? It's stupid). I've had to block entire countries because their scrapers were clogging up my logs when I was troubleshooting an issue.

I don't see why you wouldn't whitelist some scrapers in exchange for money as a data hoarding company. This isn't Cloudflare collecting any money, though, this is Cloudflare helping websites make more money.

AlienRobot 1 year ago | |

I think this is a temporary problem. In a few years many AI companies will run out of VC money, others will be only after "low-background" content made before AI spam. Maybe one day nature will heal.

paxys 1 year ago | |

> Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers

And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.

lolinder 1 year ago | | |

Either AI training is fair use or it isn't. If it's fair use then businesses shouldn't get a say in whether the data can be used for it. If it isn't, then the answer to your question is copyright law.

Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

toomuchtodo 1 year ago | | |

The end result is browser extensions, like Recap the Law [1] for PACER, that streams data back from participating user browsers to a target for batch processing and eventual reconciliation.

Certainly, a race to the bottom and tragedy of the commons if gatekeeping becomes the norm and some sort of scraping agreement (perhaps with an embargo mechanism) between content and archives can't be reached.

[1] https://free.law/recap/faq

billyhoffman 1 year ago | | |

Licensing. Common Crawl could change the license of how the data it produces is used.

Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:

https://commoncrawl.org/terms-of-use/ https://commoncrawl.org/faq

While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.

ToucanLoucan 1 year ago | | |

I mean, this is exactly what people like myself were predicting when these AI companies first started spooling up their operations. Abuse of the public square means that public goods are then restricted. It's perfectly rational for websites of any sort who have strong opinions on AI to forbid the use of common crawl, specifically because it is being abused by AI companies to train the AI's they are opposed to.

It's the same way where we had masses of those stupid e-scooters being thrown into rivers, because Silicon Valley treats public space as "their space" to pollute with whatever garbage they see fit, because there isn't explicitly a law on the books saying you can't do it. Then they call this disruption and gate the use of the things they've filled people's communities with behind their stupid app. People see this, and react. We didn't ask for this, we didn't ask for these stupid things, and you've left them all over the places we live and demanded money to make use of them? Go to hell. Go get your stupid scooter out of the river.

account42 1 year ago | |

> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access

And I'm sure Buttflare will be more than happy to sell those products.

sfmike 1 year ago | |

already sites like perplexity have been completed blocked by cloudflare due to some meta signal and can't even load it. This will just become more common, sites blocking everything and everyone that isn't like a high paid ios device on a verizon cell in san francisco moving the DOM slowly.

nonrandomstring 1 year ago | |

> There are significant knock-on effects

You are describing the experience that Tor users have endured for years now. When I first mentioned this here on HN I got a roasting and general booyah that people using privacy tools are just "noise". Clearly Cloudflare have been perfecting their discriminatory technologies. I guess what goes around comes around. "first they came for the...." etc etc.

Anyway, I see a potential upside to this, so we might be optimistic. Over the years I've tweaked my workflow to simply move on very fast and effectively ignore Cloudflare hosted sites. I know... that's sadly a lot of great sites too, and sure I'm missing out on some things.

On the other hand, it seems to cut out a vast amount of rubbish. Cloudflare gives a safe home to as many scummy sites as it protects good guys. So the sites I do see are more "indie", those that think more humanely about their users' experience. Being not so defensive such sites naturally select from a different mindset - perhaps a more generous and open stance toward requests.

So what effect will this have on AI training?

Maybe a good one. Maybe tragic. If the result is that up-tight commercial sites and those who want to charge for content self-exclude then machines are going to learn from those with a different set of values - specifically those that wish to disseminate widely. That will include propaganda and disinformation for sure. It will also tend to filter out well curated good journalism. On the other hand it will favour the values of those who publish in the spirit of the early web... just to put their own thing up there for the world.

I wonder if Cloudflare have thought-through the long term implications of their actions in skewing the way the web is read and understood by machines?

shadowgovt 1 year ago | |

> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't

... and that future has been a long time coming. People who want an alternative to advertising-supported online content? This is what that alternative looks like. Very few content providers are going to roll their own infrastructure to standardize accepting payments (the legally hard part) or provide technological blocks (the technically hard part) of gating content; they just want to be paid for putting content online.

Terr_ 1 year ago | | |

> People who want an alternative to advertising-supported online content? This is what that alternative looks like.

Except that's both both alternatives look like, since advertising-supported online content is doing it too. Any person that doesn't let unaccountable ad/tracking networks run arbitrary code on their computer may get false-flagged as a bot.

creatonez 1 year ago |

This seems like a gimmick. Isn't preventing crawling a sisyphean task? The only real difference this will make is further entrenching big players who have already crawled a ton of data. And if this feature comes at the cost of false positives and overbearing captchas, it will start to affect users.

neilv 1 year ago |

Cloudflare found a new variation on their traditional service of protecting from abusers.

This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.

And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.

I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

flaburgan 1 year ago |

I was recently speaking with people from OpenFoodFacts and OpenStreetMap, and I guess Wikipedia as the same issue. They are under constantly DDoS by bots which are scraping everything, even if the full dataset can be downloaded for free with a single HTTP request. They said this useless traffic was a huge cost for them. This is not about copyright, just about bots being stupid and people behind them not caring at all. We for sure need a solution to this. To maintain a system online nowadays means not only they get your data but you pay for that!

kijin 1 year ago |

AI scrapers are parasites.

I don't care whether you're OpenAI, Amazon, Meta, or some unknown startup. As soon as you generate a noticeable load on any of the servers I keep my eyes on, you'll get a blank 403 from all of the servers, permanently.

I might allow a few select bots once there is clear evidence that they help bring revenue-generating visitors, like a major search engine does. Until then, if you want training data for your LLM, you're going to buy it with your own money, not my AWS bill.

kccqzy 1 year ago | |

The AI scrapers are failing to discover something old-style search engines have been doing for decades: respecting a host and not giving them too much load. I'd say you did a good job banning those that generate noticeable load.

h8hawk 1 year ago | |

> AI scrapers are parasites.

I've been making crawlers for a living! Thanks for informing me that I'm a parasite.

FlyingSnake 1 year ago |

More details here at the Cloudflare blog: https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-c...

sdflhasjd 1 year ago |

How long does the world-wide-web have left? It's always felt like it would be around forever, but at some point it will fade into obscurity like IRC has done. The golden age, I feel, has been gone a while, but "AI" seems like the beginning of the end.

ivanjermakov 1 year ago | |

"AI" is the beginning of the end the same way as spam, malware and bot content were perceived in the past. To every action there is a reaction and "AI" won't be an exception.

neilv 1 year ago |

> A demo of AI Audit shared with TechCrunch showed how website owners can use the tool to see how AI models are scraping their sites. Cloudflare’s tool is able to see where each scraper that visits your site comes from, and offers selective windows to see how many times scrapers from OpenAI, Meta, Amazon, and other AI model providers are visiting your site.

And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.

BSDobelix 1 year ago | |

I would say let's get rid of copyright and software patents altogether ;)

blibble 1 year ago | | |

they're already gone

but only if you're well funded (OpenAI)

yard2010 1 year ago | |

This is such a nice opportunity for 4chan weirdos to teach the AI some new slurs.

zebomon 1 year ago |

Here's a look at my AI Audit on Bingeclock for anyone who's curious. Interesting drop in the last 48 hours given that it coincided with Cloudflare's announcement.

https://www.bingeclock.com/blog/img/ai-audit-cloudflare-0923...

The payment program sounds intriguing, I suppose. I can't imagine it will do much to move the needle for websites that will become unviable due to traffic drain. Without a doubt, AI scrapers will (quite rationally from their POV) avoid anything but nominal payments until they're forced to do otherwise.

dageshi 1 year ago |

Ahhh I love it. The era of silo's has well and truly arrived, I hope websites milk every dollar they can from the AI startups, they can afford it!

marcus_holmes 1 year ago |

> If you don’t compensate creators one way or another, then they stop creating, and that’s the bit which has to get solved

I'm not sure this is true. Maybe they stop creating commercial stuff for sale, and go do something else for money, but generally creative people don't stop creating just because they can't get paid for it.

osigurdson 1 year ago |

Next step: generate reams of content using generative AI and get paid by Cloudflare when this is scanned by generative AI.

Mistletoe 1 year ago |

How will Scraping Chad deal with this?

https://www.reddit.com/r/webscraping/comments/w1ve97/virgin_...

sunshadow 1 year ago | |

There is no difference between this and a well known bot prevention mechanism, from the scraper perspective.

boristsr 1 year ago |

I'm pretty interested in how companies are exploring how to properly monetize or compensate for scraped content to help keep a strong ecosystem of quality content. Id love to see more efforts like this.

kylehotchkiss 1 year ago |

Is anybody else seeing an absolutely massive amount of Amazonbot crawls on their site? What are they up to? And why so aggressively?

n_ary 1 year ago | |

Most likely aspiring AI startups gathering as much data as they can before regulation jaws snap shut around them cutting off the blood stream.

In this AI race(hype), data is finally the ultimate gold. Also at the rate the information is polluted by GenAI junk all over, any remnants of real data is holy grail.

kylehotchkiss 1 year ago | | |

So any unknown or upcoming AIs would just show as Amazon?

nitwit005 1 year ago | |

They have documentation on verifying if it is indeed their bot: https://developer.amazon.com/amazonbot

sharpshadow 1 year ago |

It is indeed a huge waste to scrape the same whole site for changes and new content. If Cloudflare is capable to maintain an overview about changes and updates it could save a lot of resources.

The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.

jsheard 1 year ago | |

The sitemap.xml spec already has fields for indicating the last time a page was changed and how often it's expected to change in the future, so that search engines can optimize their updates accordingly, but AI scrapers tend to disregard that and just download the same unchanged page 10,000 times for the hell of it.

Aachen 1 year ago | | |

> sitemap.xml spec already has fields for indicating the last time a page was changed

I did not know that bit! I'm considering adding this to my site now, because it sounds like it would save a lot of resources for everyone. Do (m)any crawlers use this information in your experience?

NoMoreNicksLeft 1 year ago |

Great. The HR software my company uses can charge me when my own bot "scrapes" my paystub pdf.

delanyoyoko 1 year ago |

I guess with marketplace like this, if webmasters are happy and the AI agents are also happy, then we'll be seeing quite a few services to come up with similar solution.

Then end goal will be, from search engine optimization to something like LLM optimization or prompt engine optimization.

siliconc0w 1 year ago |

Any recommendations for simple WAF tool that will stop the majority of the abuse without having to use Cloudflare? I use Cloudflare just to keep that noise away from my logs but I'm not super keen to be dependent on them.

AtNightWeCode 1 year ago |

Maybe they could solve some of the core issues instead. It is like CF lost the source code and just pushing new more or less useless features all the time. Even though I think this is a fair change.

CatWChainsaw 1 year ago |

I guess Web3 will exist after all. In a microtransaction-per-webpage-utilized sense. No way websites don't start charging real people when there's money to be made.

dangoodmanUT 1 year ago |

the blog makes it seem like the bot buys access

but if they are only tracking the bot via the user agent

then can't i piggyback on that user agent?

no ai scraper is going to include an auth header when accessing your website...

rahimnathwani 1 year ago |

  While it’s a bold idea, Cloudflare is not sharing a fully fleshed-out idea of what its marketplace will look like.

datavirtue 1 year ago |

Wasn't the web designed to be scraped?

015a 1 year ago |

One minor, tedious thing that I've become so tired of lately is showcased very plainly in the screenshot in this article: That the Cloudflare admin dashboard has now prominently placed "AI Audit (ALPHA)" as a top-level navigation menu item at the very top of the list of a Cloudflare Account's products. Everyone is doing this, for AI products or whatever came before them, and it genuinely pushes me away from paying for Cloudflare, as I get the distinct sense that they aren't building the things or fixing the problems that I feel are important to me.

I would greatly appreciate the ability to customize the items and ordering of those items in this sidebar.

renewiltord 1 year ago |

Just use some residential proxy network and slam your target. They can't detect you.

brikym 1 year ago | |

Cloudflare has probably noticed those proxy networks are quite expensive.

renewiltord 1 year ago | | |

Sometimes get hit by the captcha but captcha solvers are cheap (0.3 cents a captcha).

synack 1 year ago |

Are they gonna let me block the scrapers that run on Cloudflare Workers?

j45 1 year ago |

Neat licensing idea - look forward to seeing some case studies.

johnisgood 1 year ago |

How are they going to pay? How much? Can it be enforced?

micromacrofoot 1 year ago |

absent of legal changes this mostly rewards companies that figure out how to scrape without being detected, this problem has existed before AI

zkid18 1 year ago |

What's wrong with AI agents accessing website content? We seem to have been happy with Google doing that for ages in exchange for displaying the website in search results.

red_admiral 1 year ago | |

The website owner chooses. They can say "nope" in robots.txt. Not everyone respects this, but Google does. Google can choose not to show that site as a result, if they want to.

This adds a third option besides yes and no, which is "here's my price". Also, because cloudflare is involved, bots that just ignore a "nope" might find their lives a bit harder.

lolinder 1 year ago | | |

Robots.txt is for crawlers. It's explicitly not meant to say one-off requests from user agents can't access the site, because that would break the open web.

6gvONxR4sf7o 1 year ago | |

The thing people have been doing for ages is a trade: I let you scrape me and in return you send me relevant traffic. The new choice isn't about a trade, so it's different.

spiderfarmer 1 year ago | |

And AI agents scrape your content in exchange for what exactly?

zkid18 1 year ago | | |

Sorry, I distinguish here an AI agent that basically automate the visual lookup and scraping to feed into LLMs by big tech. I don't see any problem with the first one tbh.

lolinder 1 year ago | |

Yeah, there's a lot of confusion between AI training and AI agent access, and it's dangerous.

Training embeds the data into the model and has copyright implications that aren't yet fully resolved. But an AI agent using a website to do something for a user is not substantially different than any other application doing the same. Why does it matter to you, the company, if I use a local LLaMA to process your website vs an algorithm I wrote by hand? And if there is no difference, are we really comfortable saying that website owners get a say in what kinds of algorithms a user can run to preprocess their content?

jsheard 1 year ago | | |

> But an AI agent using a website to do something for a user is not substantially different than any other application doing the same.

If the website is ad-supported then it is substantially different - one produces ad impressions and the other doesn't. Adblocking isn't unique to AI agents of course but I can see why site owners wouldn't want to normalize a new means of accessing their content which will inherently never give them any revenue in return.

brigadier132 1 year ago | |

For traditional search indexing the interests of the aggregator and the content creator were aligned. AIs on the other hand are adversarial to the interest of content creators, a sufficiently advanced AI can replace the creator of the content it was trained on.

lolinder 1 year ago | | |

We're talking in this subthread about an AI agent accessing content, not training a model on content.

Training has copyright implications that are working their way through courts. AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.

Workaccount2 1 year ago |

Props to cloudaflare for referring to it as "scanning your data", which is probably the most technically accurate way to describe what AI training bots are doing.

johnsutor 1 year ago |

Or, you know, just create your own API for your platform and charge people per request to that.

zackmorris 1 year ago |

Boy I'm sick of clicking "Verify you are human" on everything from GitLab to banking apps running Cloudflare.

Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.

One company shouldn't be allowed to police access to the internet. And certainly shouldn't be allowed to start gatekeeping what is viewable by discriminating against the person or software doing the viewing.

I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community. If you work there, it might be time to consider getting a different job. If you own stock, maybe divest. If you're connected, perhaps your associates can buy from competitors. That's probably the only way to get the board and CEO replaced these days.

xyzzy_plugh 1 year ago |

Ah yes, the ol' monopoly invents an illusionary marketplace ploy.

Cloudflare is obviously right here. AI has changed things so an open web is no longer possible. /s

What absolute garbage.

kelsey98765431 1 year ago |

lol good luck

meiraleal 1 year ago |

Wow, a big tech thinking about creators not about how to extract all they can but to give back. That became so uncommon nowadays. Cloudflare deserves their exponential growth. Kudos for them.

giancarlostoro 1 year ago |

I really love Cloudflare. They're always up to something interesting and different. I hope we see more companies rise up similar to Cloudflare. I almost want to say Cloudflare is everything we hoped Google would be, but Google became another corporate cog machine that innovates and then scraps things up in one swoop. I don't recall the last I heard of Cloudflare spinning something up just to wind it back down? I don't think its impossible for them to make a bad choice, but I think they really think their projects through typically.

My biggest problem with AI will be once it starts getting legislated, it will just be limited in how it can function / be built, we are going to lock in existing LLMs like ChatGPT in the lead and stop anyone from competing since they wont be able to train on the same data.

My other biggest problem is "AI" or really LLMs which is what everyones hyped about, is lack of offline first capabilities.

curl -I -H "User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/105.0.5195.102 Safari/537.36" https://www.cloudflare.com