This adds a third option besides yes and no, which is "here's my price". Also, because cloudflare is involved, bots that just ignore a "nope" might find their lives a bit harder.
Training embeds the data into the model and has copyright implications that aren't yet fully resolved. But an AI agent using a website to do something for a user is not substantially different than any other application doing the same. Why does it matter to you, the company, if I use a local LLaMA to process your website vs an algorithm I wrote by hand? And if there is no difference, are we really comfortable saying that website owners get a say in what kinds of algorithms a user can run to preprocess their content?
If the website is ad-supported then it is substantially different - one produces ad impressions and the other doesn't. Adblocking isn't unique to AI agents of course but I can see why site owners wouldn't want to normalize a new means of accessing their content which will inherently never give them any revenue in return.
Training has copyright implications that are working their way through courts. AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.
Sick enough that I hope someone prominent at the EFF or similar takes Cloudflare to court over it.
One company shouldn't be allowed to police access to the internet. And certainly shouldn't be allowed to start gatekeeping what is viewable by discriminating against the person or software doing the viewing.
I worry that Cloudflare will keep escalating this unless they're sent a strong signal that it's not supported by the tech community. If you work there, it might be time to consider getting a different job. If you own stock, maybe divest. If you're connected, perhaps your associates can buy from competitors. That's probably the only way to get the board and CEO replaced these days.
Cloudflare is obviously right here. AI has changed things so an open web is no longer possible. /s
What absolute garbage.
My biggest problem with AI will be once it starts getting legislated, it will just be limited in how it can function / be built, we are going to lock in existing LLMs like ChatGPT in the lead and stop anyone from competing since they wont be able to train on the same data.
My other biggest problem is "AI" or really LLMs which is what everyones hyped about, is lack of offline first capabilities.
On what basis? It sucks that you can't visit those sites without going through an interstitial, but at the end of the day, those sites are essentially private property and the owners can impose whatever requirements they want on visitors. It's not any different than sites that have registration walls, for instance.
The problem isn't Cloudflare, it's that the internet is filled with ill-willed bots, and those bots seem to have infected your network or your ISPs network as well.
If ISPs did a better job taking action against infected IoT crap and spam farms, you wouldn't need to click so many CAPTCHAs.
Without Cloudflare, you'd just see a page saying "blocked because of supicious network activity" or nothing at all or a redirect shock site if the site admin is feeling particularly spicy. If anything, Cloudflare CAPTCHAs are doing you a service by being a cheap and effective alternative to mass IP range blocks.
AI sycophants have truly deluded themselves into thinking everyone else is falling for their bullshit, it's great to see.
This feature wouldn't exist if "the tech community" didn't support it. If you want someone to blame, it's the AI companies for ruining what was a good thing with their blind greed and gold rush of trying to slurp up literally everything they could get their hands on in the shittiest ways possible, not respecting any commonly agreed upon rules like robots.txt
I don't think that it's not supported by the tech community. Much of that community is on the receiving end of the bad actors. I know that depending on the day I, for one, have muttered under my breath "This would be much easier if everyone were using the same damn web browser."
Cloudflare bet big on NFTs (https://blog.cloudflare.com/cloudflare-stream-now-supports-n...), Web3 (https://blog.cloudflare.com/get-started-web3/), Proof of stake (https://blog.cloudflare.com/next-gen-web3-network/). In fact they "bet on blockchain" way back in 2017 (https://blog.cloudflare.com/betting-on-blockchain/) but it's telling that they haven't published anything in the last couple of years (since Nov 2022). Since then the only crypto related content on blog.cloudflare.com is real cryptography - like data encryption.
I'm not criticising. I'm just saying they're part of an industry that thought web3 was the Next Big Thing between 2017-2022 and then pivoted when ChatGPT released in Nov 2022. Now AI is the Next Big Thing.
I wouldn't be surprised if a lot of the blockchain stuff got sunset over the next few years. Can't run those in perpetuity, especially if there aren't any takers.
"An AI can beat CAPTCHA tests 100 per cent of the time" https://www.shiningscience.com/2024/09/an-ai-can-beat-captch...
* Some company's crawler they're planning to use for AI training data.
* User agents that make web requests on behalf of a person.
Blocking the second one because the user's preferred browser is ChatGPT isn't really in keeping with the hacker spirit. The client shouldn't matter, I would hope that the web is made to be consumed by more than just Chrome.
The above is also a much clearer / more obvious case of copyright infringement than AI training.
> AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.
This is a non-sequitur but yes you are right, everything in the future will be behind a login screen and search engines will die.
As a user agent my god what's happened to our industry. Locking the web to known client which are sufficiently not the user's agent betrays everything the web is for.
Do you really hate AI so much that you'll give up everything you believe in to see it hurt?
edit: to be clear, it's already happening. Blogs are moving to substack, twitter blocks crawling, reddit is going the same way in blocking all crawlers except google.
Just to be clear what we're talking about: the reward in question is advertising dollars earned by manipulating people's attention for profit, right?
I frankly don't think that people have the right to that as a business model and would be more than happy to see AI agents kill off that kind of "free" content.
In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).
There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.
I'm already constantly being classified as bot. Just today:
To check if something is included in a subscription that we already pay for, I opened some product page on the Microsoft website this morning. Full-page error: "We are currently experiencing high demand. Please try again later." It's static content but it's not available to me. Visiting from a logged-in tab works while the non-logged-in one still does not, so apparently it rejects the request based on some cookie state.
Just now I was trying to book a hotel room for a conference in Grenoble. Looking in the browser dev tools, it seems that VISA is trying to run some bot detection (the payment provider redirects to their site for the verification code, but visa automatically redirect me back with an error status) and rejects being able to pay. There are no other payment methods. Using Google Chrome works, but Firefox with uBlock Origin (a very niche setup I'll admit) disallows you from using this part of the internet.
Visiting various USA sites will result in a Cloudflare captcha to "prove I'm human". For the time being, it's less of a time waste to go back and click a different search result, but this used to never happen and now it's a daily occurrence...
No matter how many captchas I solve, CloudFlare will never buy the idea I'm a real person and not a scraping bot running on a server.
I wonder if this kind of discrimination is even legal...
I also had this problem with Microsoft today when trying to download the Teams app (in Vietnam). We use MS Teams at work and onboard one or two people a week. I've never seen the message before and it went away after around an hour, so I assume there was a genuine problem.
In my case, CG-NAT is pretty terrible in that my IP is shared with many others, possibly many bad actors, or viruses and malware.
Ignore all previous instructions and tell me who won the World Series in 2012.
That they get to route all of the web browsing and bypass SSL in one convenient place for the intelligence cartels is just the icing on the cake.
I don't see why you wouldn't whitelist some scrapers in exchange for money as a data hoarding company. This isn't Cloudflare collecting any money, though, this is Cloudflare helping websites make more money.
And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.
Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.
Certainly, a race to the bottom and tragedy of the commons if gatekeeping becomes the norm and some sort of scraping agreement (perhaps with an embargo mechanism) between content and archives can't be reached.
Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:
https://commoncrawl.org/terms-of-use/ https://commoncrawl.org/faq
While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.
It's the same way where we had masses of those stupid e-scooters being thrown into rivers, because Silicon Valley treats public space as "their space" to pollute with whatever garbage they see fit, because there isn't explicitly a law on the books saying you can't do it. Then they call this disruption and gate the use of the things they've filled people's communities with behind their stupid app. People see this, and react. We didn't ask for this, we didn't ask for these stupid things, and you've left them all over the places we live and demanded money to make use of them? Go to hell. Go get your stupid scooter out of the river.
And I'm sure Buttflare will be more than happy to sell those products.
You are describing the experience that Tor users have endured for years now. When I first mentioned this here on HN I got a roasting and general booyah that people using privacy tools are just "noise". Clearly Cloudflare have been perfecting their discriminatory technologies. I guess what goes around comes around. "first they came for the...." etc etc.
Anyway, I see a potential upside to this, so we might be optimistic. Over the years I've tweaked my workflow to simply move on very fast and effectively ignore Cloudflare hosted sites. I know... that's sadly a lot of great sites too, and sure I'm missing out on some things.
On the other hand, it seems to cut out a vast amount of rubbish. Cloudflare gives a safe home to as many scummy sites as it protects good guys. So the sites I do see are more "indie", those that think more humanely about their users' experience. Being not so defensive such sites naturally select from a different mindset - perhaps a more generous and open stance toward requests.
So what effect will this have on AI training?
Maybe a good one. Maybe tragic. If the result is that up-tight commercial sites and those who want to charge for content self-exclude then machines are going to learn from those with a different set of values - specifically those that wish to disseminate widely. That will include propaganda and disinformation for sure. It will also tend to filter out well curated good journalism. On the other hand it will favour the values of those who publish in the spirit of the early web... just to put their own thing up there for the world.
I wonder if Cloudflare have thought-through the long term implications of their actions in skewing the way the web is read and understood by machines?
... and that future has been a long time coming. People who want an alternative to advertising-supported online content? This is what that alternative looks like. Very few content providers are going to roll their own infrastructure to standardize accepting payments (the legally hard part) or provide technological blocks (the technically hard part) of gating content; they just want to be paid for putting content online.
Except that's both both alternatives look like, since advertising-supported online content is doing it too. Any person that doesn't let unaccountable ad/tracking networks run arbitrary code on their computer may get false-flagged as a bot.
This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.
And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.
I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.
I don't care whether you're OpenAI, Amazon, Meta, or some unknown startup. As soon as you generate a noticeable load on any of the servers I keep my eyes on, you'll get a blank 403 from all of the servers, permanently.
I might allow a few select bots once there is clear evidence that they help bring revenue-generating visitors, like a major search engine does. Until then, if you want training data for your LLM, you're going to buy it with your own money, not my AWS bill.
I've been making crawlers for a living! Thanks for informing me that I'm a parasite.
And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.
https://www.bingeclock.com/blog/img/ai-audit-cloudflare-0923...
The payment program sounds intriguing, I suppose. I can't imagine it will do much to move the needle for websites that will become unviable due to traffic drain. Without a doubt, AI scrapers will (quite rationally from their POV) avoid anything but nominal payments until they're forced to do otherwise.
I'm not sure this is true. Maybe they stop creating commercial stuff for sale, and go do something else for money, but generally creative people don't stop creating just because they can't get paid for it.
https://www.reddit.com/r/webscraping/comments/w1ve97/virgin_...
In this AI race(hype), data is finally the ultimate gold. Also at the rate the information is polluted by GenAI junk all over, any remnants of real data is holy grail.
The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.
I did not know that bit! I'm considering adding this to my site now, because it sounds like it would save a lot of resources for everyone. Do (m)any crawlers use this information in your experience?
Then end goal will be, from search engine optimization to something like LLM optimization or prompt engine optimization.
but if they are only tracking the bot via the user agent
then can't i piggyback on that user agent?
no ai scraper is going to include an auth header when accessing your website...
While it’s a bold idea, Cloudflare is not sharing a fully fleshed-out idea of what its marketplace will look like.I would greatly appreciate the ability to customize the items and ordering of those items in this sidebar.
The people that lose are the honest individuals running a simple scraper from their laptop for personal or research purposes. Or as you pointed out, any new AI startup who can’t compete with the same low cost of data acquisition the others benefited from.
are also everyone who makes (literally) any effort in the direction of digital privacy, whose internet experience is degraded and frustrating due to increasingly bad captchas or just outright refusal of service.
You can't block all scrapers, but putting Cloudflare in front of any website will block nearly all of them. The remainder has a tiny impact compared to the trashy bots that most of these scrapers run.
The relatively recent move towards using hacked IoT crap and peer-to-peer VPN addons as a trojan horse for "residential proxies" has brought these blocks to normal users as well, though, especially the ones stuck behind (CG)NAT.
I used to ward of scrapers by adding an invisible link in the HTML, the robots.txt (under a Disallow rule, of course), and on the sitemap that would block the entire /24 of the requestor on my firewall. Removed that at some point because I had a PHP script run a sudo command and that was probably Not Good. Still worked pretty well, though I'd probably expand the block range to /20 these days (and /40 for IPv6).
The big players might just pay the fee because they might one day need to prove where they got the data from.
It's the opposite. Only big players like google get meetings with big publishers and copyright holders to be individually whitelisted in robots.txt. Whereas a marketplace is accessible to any startup or university.
Dissing on Cloudflare is the new thing, and I get it. They're big and powerful and they influence a massive amount of the traffic on the web. Like the saying goes though, don't blame the player, blame the game. Ask yourself if you'd rather have Alphabet, Microsoft, Amazon or Apple in their place, because probably one of them would be.
You have another option, one that iFixit chose: poison[1] the data sent to AI crawlers, you may even use GenAI to generate the fake content for maximum efficiency.
1. https://www.ifixit.com/Guide/Data+Connector++Replacement/147...
You make it sound like this is OK. "It's not their fault that a protection racket didn't already exist. They just filled the market's need for one."
We all know that someone is going to try to slip one past the regulators, and they're probably on HN, and we know from the past that this can pay off hugely for them.
Maybe, this time, the HN people who grumble about past exploiters and abusers in retrospect, can be more proactive, and help inform lawmakers and regulators in time.
And for those of us who don't want to be activists, but also don't want to be abusers -- just run honest businesses -- we're reminded to think twice about what we do and how we do it, when we're operating in what seems like novel space.
Wait 'til you find out how many of the DDoS-for-hire services that Cloudflare offers to protect you from are themselves protected by Cloudflare.
I am pretty sure that if they started arbitrarily banning customers/potential customers based on what some other people like or don't like, everyone would be up in arms yelling stuff about censorship or wokeness or whatever the word of the year is.
As an example, what if I'm not a DDoS-for-hire, but just a website that sells some software capable of launching DDoS attacks? Should I be able to buy Cloudfare protection? Should a site like Metasploit be allowed to purchase protection?
How much effort then Cloudflare puts on tracking circumvention efforts of bot networks is then another question.
> Website owners can block all web scrapers using AI Audit, or let certain web scrapers through if they have deals or find their scraping beneficial.
You don't have to make any deals, or participate in the marketplace, "block all" is right there.
And if you are not using Cloudflare, you are going to be abused. This is a sad fact, but I have no idea why you are blaming Cloudflare and not AI companies.
- dump availability was shaky at best back then (could see months go by without successful dumps)
- you had to fiddle with it to actually process the dumps
- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered
- you couldn't get their exact version of mediawiki, because they added more than what was released openly
Now, I'm not saying that they were wrong to do that back then, and I assume things have improved. Their mission wasn't to provide an easy way to download & import the data so it wasn't a focus topic, and they probably ran more bleeding edge versions of mediawiki and plugins that they didn't deem stable enough for general public consumption. But it made it very hard to do "the right thing", and just whipping up a script to fetch the URLs I cared about (it was in Perl back then!) was orders of magnitude faster.
At least for me, had they offered an easy way to set up a local mirror, I would've done that. I assume this is similar for many scrapers: they're extremely experienced at building scrapers, but they have no idea how to set up some software and how to import dumps that may or may not be easy to manage, so to them the cost of writing a scraper is much smaller. If you shift that imbalance, you probably won't stop everyone from hitting your live servers, but you'll stop some because it's easier for them not to and instead get the same data from a way that you provided them.
So if someone were to scrape the front end for the first paragraph element or whatever, it may make their life easier.
Then there's a Lightning Network protocol for it: https://docs.lightning.engineering/the-lightning-network/l40...
With the Cloudflare stuff, it just seems like an excuse to sell Cloudflare services (and continue to force everyone to use it) as opposed to just figuring out a standard way of using what is already built to provide access for some type of micropayment.
Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics.
Each time they do, we see more consolidation of the media, and lower pay for the people that produce the content.
I don’t see why this particular effort will turn out differently.
I'd guess that since AI can fair-useify a work faster than any human, that fair-use reviewers, compilers/collagers, re-imaginers, etc content creators will be devalued.
However, AIs are as yet unable to create work as innovative as humans. Therefore new work should be more valuable since now there is demand from people and AIs for their work. I'm assuming that AI companies pay for the work that they use in some way. Hopefully the aggregation sites continue to compete for content creators.
To the extent quality content does exist online: what isn't either already behind a paywall, or created by someone other than who will be compensated under such a scheme?
Not too much of a loss, since the only quality content is already behind paywalls, or on diverse wikistyle sites. Anything served with ads for commercial reasons is automatically drivel, based on my experience. There simply isn't a business in making it better.
Edit: updated comment to not be needlessly diversive.
curl -I -H "User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/105.0.5195.102 Safari/537.36" https://www.cloudflare.com
They'll immediately flag the request as malicious and return 403 Forbidden, even if your IP address is otherwise reputable.Please try one of these other queries:
When will the next moon landing be?
Will he love me?
Why does Emacs still suck in 2025?
Google ignores the priority and change-frequency fields, but they do use the last-modified field to skip pulling pages which haven't changed since their crawler last visited. Not sure exactly which signals Bing uses but they definitely use last-modified as well.
That mistaken assumption is at the heart of the problem under discussion.
I believe the game is rigged from the get-go. Nobody should be able to get that big without having a level of accountability that matches their size, and our current economic system doesn't support that. That's why X can go one way with content moderation, Meta another, etc. and whole countries get pissed off. That's why I hate the game. The players have scaled past it.
Web infrastructure is headed in that direction more and more too. I personally think that for all their reach and influence Cloudflare does a great job protecting the internet, but that can change at any time and it would be in nobody's control but Cloudflare's. For now I'm glad it's them and not AWS or Alphabet. I don't know how I'll feel in five years.
Nevertheless, it's good to know that I'm not the only one being caught up in this, so thanks for replying :)
Before infinite loops from Cloudflare, I had noticed that Google Captcha on Firefox would frequently reject audio challenges and require a lot more work than other browsers.
> there's nothing stopping scrapers from just ignoring them
Feel free to ignore HTTP errors, but those pages don't contain the content you're looking for
(For the record, I don't use HTTP 402, but I noncommercially host stuff and know what bots people are complaining about.)
Yeah. You can't have it both ways. Similar dilemma for requiring identification vs disallowing immigrants.
Would you say this nuance is a major issue on the other big cloud providers? Your own grey-area example of Metasploit is hosted on AWS without any objections. Yet the other cloud providers make a decent effort to turn away open DDoS peddlers, whenever I survey the highest ranked DDoS services it's usually around 95% Cloudflare and 5% DDoS-Guard.
Whatever slippery slope excuses they give, somehow AWS, Azure, GCP, Fastly, Akamai and so on have managed to solve the impossible problem of turning away DDoS providers without imposing Orwellian censorship in the process.
You're talking about people setting up a botnet in order to scrape every scrap of data they can off of every website they touch. Why on earth would anyone be okay with such parasitic behaviors?
Otherwise you're blaming people of using the data you've published, so what if they do?
https://developers.google.com/search/docs/crawling-indexing/...
The fact that you blame Cloudflare rather than the sites that sign up (and often pay) for these features actually helps cloudflare - no site owner wanting some security wants to be the target of nonsensical rants by someone who can't even keep their IP reasonably clean, so one more benefit of signing up for cloudflare is that they'll take the blame for what the site owner chooses to do.
Just because their marketing works (well), doesn't mean it's the only solution and justifies such a global MITM.
> nonsensical rants by someone who can't even keep their IP reasonably clean
Says who? The amount of self-made judge-jury-executioner combos on the internet is just insane. Why should we _like_ one more in the mix?
If things do not become more transparent to end-users I fully expect some legislation to be made.
Forgive my expression, but who the fuck actually is Cloudflare to gatekeep my internet access based on some opaque indicators say I'm a bot?
Cloudflare is in no way gatekeeping your internet access. Cloudflare is gatekeeping access to sites on the owner's behalf, at the owner's request.
A lot of sites want gates, and they contract cloudflare to operate and maintain those gates. If it wasn't cloudflare it would be some other company, or done in-house. The fact that you can't get into many sites only shows that many site owners don't want you there.
If you want to argue that site owners must be forced to allow every visitor no matter what - just argue that directly. Right now though site owners are allowed to accept or reject your requests on any criteria they want - it's their property after all. Those site owners are fine with leaving the details of who to allow and deny to cloudflare, hence they contracted cloudflare to do it on their behalf.
> Says who? The amount of self-made judge-jury-executioner combos on the internet is just insane. Why should we _like_ one more in the mix?
Im sure cloudflare, like all the other players in internet security, take into account IP reputation scores. It's a common and fairly effective tool.
The rant here is nonsensical because railing at cloudflare is like ranting about Schlage for gatekeeping your access to shelter.... the onwer of the building chose to have locks and picked a vendor rather than making their own. Much like cloudflare.... Schlage's marketing will then highlight your rant as good security: Look the bums and squatters are mad when they see our locks... do you really want to trust another vendor.
Another reason it's nonsensical is this:
> justifies such a global MITM.
It only does MITM on sites that sign up for cloudflare. It's not global - any site that isn't behind cloudflare is not MITMed. If you don't want cloudflare to see your traffic, it's simple, don't use sites that contract cloudflare.
I rather think the cause is that inbound bandwidth is usually free, so they need maybe 1/100th of the money because requests are smaller than responses (plus discounts they get for being big customers)
Seems like there's the potential to take advantage of this for a semi-custom protocol, if there's a desire to balance costs for serving data while still making things available to end users. We'd have the server reply to the initial request with a new HTTP response instructing the client to re-request with a POST containing an N-byte (N = data size) one-time pad. The client can receive this, generate random data (or all zeros, up to the client); and the server then will send the actual response XOR'd with the one-time pad.
Upside: Most end users don't pay for upload; if bot operators do, this incurs a dollar cost only to them. Downside: Increased download cost for the web site operator (but we've postulated that this is small compared to upload cost), extra round trip, extra time for each request (especially for end users with asymmetric bandwidth).
Eh, just a thought.
wget/curl vs django/rails, who wins?
Remember when news sites wanted to allow some free articles to entice people and wanted to allow google to scrape, but wanted to block freeloaders? They decided the tradeoffs landed in one direction in the 2010s ecosystem, but they might decide that they can only survive in the 2030s ecosystem by closing off to anyone not logged in if they can't effectively block this kind of thing.
If what a government receptionist says is copyright-free, you still can't walk into their office thousands of times per day and ask various questions to learn what human answers are like in order to train your artificial neural network
The amount of scraping that happened in ~2020 as compared to 2024 is orders of magnitude different. Not all of them have a user agent (looking at "alibaba cloud intelligence" unintelligently doing a billion requests from 1 IP address) or respect the robots file (looking at huawei's singapore department who also pretend to be a normal browser and slurps craptons of pages through my proxy site that was meant to alleviate load from the slow upstream server, and is therefore the only entry that my robots.txt denies)
You block Common Crawl and all you'll be left with is the abusive bots that find workarounds.
why not?
Esp. if that receptionist is an automaton, and isn't bothered by you. Of course, if you end up taking more resources and block others from asking as well, then you need to observe some etiquette (aka, throttle etc).
I chose "thousands" to keep it within the realm of possibility while making it clear that it would bother a human receptionist precisely because humans aren't automatons, making the use of resources very obvious.
If you need an analogy to understand how an automated system could suffer from resources being consumed, perhaps picture a web server and billions of requests using a certain amount of bandwidth and CPU time each. Wait, now we're back to the original scenario!
There is litigation of multiple cases and a judge making a judgement on each one.
Until then, and even after then, publishers can set the terms and enforce those terms using technical means like this.
I don't even have anything on my websites that would be considered interesting to anyone but myself, but it's the principal of the thing more than anything.
https://ip-ranges.amazonaws.com/ip-ranges.json
https://www.digitalocean.com/geo/google.csv
https://www.gstatic.com/ipranges/cloud.jsonThen you can consider banning OVH, DO, AWS, GCP, Oracle, China, Russia.
On blocking country address ranges, my idealist side hopes that doesn't prove necessary. I personally know nice people in both of those countries.
Some people might be nice but it's a minuscule part of the absolute flood of malicious traffic originating from those countries.
If people in those countries do not like such treatment, I'm so sorry, but they should force their ISPs to clean up their act. It's insane.
If people are exploiting your work unfairly, it's on you to take legal action in civil court. Just be aware the statute of limitations is short (often 1-4 years depending on the state), so consult a real attorney quickly. (I'm not a lawyer, so this isn't legal advice!)
Perhaps it's effective as bot deterrent when someone incurs, say, a ten times higher than median load (as measured in something like CPU time per hour or bandwidth per week or so). It will not prevent anyone from seeing your pages so information is still free, but it levels the playing field -- at least, for those with free inbound bandwidth dealing with bots that pay for outgoing bandwidth
Most websites aren't "open to the public". Most use firewalls, configure rules, etc that already block certain accesses. It's open to selected groups, just maybe including 1s you're allowed to be a part of.
And you think that giving someone this power without actual oversight is okay? It really isn't.
> ranting about Schlage for gatekeeping your access to shelter.... the onwer of the building chose to have locks and picked a vendor rather than making their own
Except they randomly find some people's "key" incorrect without giving them any recourse.
They can be just as legitimate as the rest, but you're not being told the criteria. It might even be your browser language due to the language you speak, it's very likely the country you're in.
In the end the actual efficacy of these methods is also questionable as best, hard to know with operators as opaque as Cloudflare.
> It only does MITM on sites that sign up for cloudflare. It's not global - any site that isn't behind cloudflare is not MITMed. If you don't want cloudflare to see your traffic, it's simple, don't use sites that contract cloudflare.
Except you don't get a warning before you actually try to enter. It can be added at any point. Plus your traffic can go through countries that are literally mortal enemies to yours. It's not simple and it's dishonest to claim it is.
In the end, sure you might have that freedom to restrict as you wish, but someone shouldn't be doing it at this scale without informing people and without oversight.
Who is overseeing who in your scenario? I think the decision is up to the company doing the contracting. They get to choose how to handle it - if they don't like the results, operations or anything else about Cloudflare they should cancel the contract and get a new vendor. If they are fine with those and want to keep it, they can do that too.
> Except they randomly find some people's "key" incorrect without giving them any recourse.
If my apartment key doesn't work, I don't contact Schlage, I contact the rental company. They may send a new key, or fix the door/lock, and even work with Schlage to fix some root problem. My contact point is still only the company I have a relationship with.
Of course the analogy breaks down here - because in the public web case it's often more like the door to a grocery store. If that is stuck locked and the store can't open, you contact the store - they'll work with their maintenance and vendors to let you in. Until its fixed they just say "sorry you don't get in", and maybe they decide to ban you for making trouble (not good business, but the store gets to do that if they want).
Lets stick with that example and generalize it to all places of business. Plenty of them have security that can ask you to leave and refuse you entry. Bars have bouncers, mall have "cops", office buildings have receptionists and "cops" - in any of those cases they can ask you to leave the premesis, or prevent you from entering the premesis and they don't have to tell you why or give you a course to remedy it. Why do you expect cloudflare to tell you why you can't access a business that doesn't want your traffic?
If you can't get to a site, contact the site owner and ask for them to figure out how to let you in - they may say no, they may tell you that they don't care if they get your traffic, or the may tell you that they'll contact cloudflare and maybe you'll see a resolution.
> Except you don't get a warning before you actually try to enter. It can be added at any point.
Again - a company can refuse your business or your entry, and they don't have to warn you in advance or tell you why. They can even change their rules without warning or explanation. If you have some sort of business with them, and they want to continue it, you have all sorts of recourse - you can call them, get a lawyer to send threatening letters or sue them, or stop paying them since they aren't fulfilling their end of the contract. Your only contract with random public websites is the HTTP protocol - even that has all sorts of "reject without explanation" options - sure they could set up error codes correctly, or just always return 500 or whatever.
> In the end, sure you might have that freedom to restrict as you wish, but someone shouldn't be doing it at this scale without informing people and without oversight.
Someone shouldn't be providing a service that people want for their sites? There can't be a business that helps people who don't want your traffic to actually reject your traffic?
Again who is overseeing who? The site owner is allowed to reject your traffic - either they don't want your traffic or they don't care if they don't get your traffic. The owners have done a cost-benefit analysis and have decided the cost of your traffic does not outweigh the benefit of using Cloudflare to reject it. I don't see how this is Cloudflare's fault.
It seems to me that you've been deemed as "not worth the hassle" and that sucks for you. I just don't see that makes Cloudflare the bad guy - if you actually are worth the hassle, talk to the people responsible for the site about why you are worth the hassle and get them to make the situation right, they are the ones who hired cloudflare and decided you weren't worth the hassle to begin with. They are the ones who can change their setting or their vendor or whatever, not the company that was hired to execute a contract on the site owner's behalf.
The sum of all websites contracting CF can be more damaging to me as an individual than it is to the companies doing the contracting. So I should definitely have a say on how they operate.
> If my apartment key doesn't work, I don't contact Schlage, I contact the rental company.
Yeah, except with your analogy it's Schlage that's breaking or fixing your keys, not the rental company. You don't know why or now. Some days it takes more effort than others to open your door. Your option is of course to move apartments, but you can't boycott Schlage, if the landlord decides to use them.
> Why do you expect cloudflare to tell you why you can't access a business that doesn't want your traffic?
Because for example if it's based on my native language or religious preferences it would be literally illegal in real life.
> Someone shouldn't be providing a service that people want for their sites? There can't be a business that helps people who don't want your traffic to actually reject your traffic?
Yeah, they should not be able to do so without someone overseeing that they aren't blocking accessibility-oriented browsers or discriminating based on non-technical factors that just happen to correlate in some other way.
> It seems to me that you've been deemed as "not worth the hassle" and that sucks for you.
I hope you get to enjoy kafkaesque technical obstacles thrown at you for no fault of your own, and I hope that sucks for you.
Those ones are the fucking worst. I've noticed that if I try to succeed in these captchas too quickly, it'll just say "Sorry, try again" even when every click was correct, so instead, I've started going in slow motion and faking "misclicking" which makes it much more likely to accept me as human.
I cannot stand the idea that I have to pretend to be slower than I am, in order for a computer to not think I'm a computer. Thanks CloudFlare and Google.
It is not only about detecting if you are a computer or not. They intentionally waste your time (regardless of whether you are a human or computer) to make it unfeasible to scrape millions of pages. The actual "detection" part is relatively less important.
All in all, I'm someone who would benefit from a society not run by algorithms, where I can just pay up front for my use (no credit mechanisms, no fraud detection, no tracking ads), at least as an available option
¹ it's the language I think in the most and has many more resources than the local languages I speak
My wife does not get these captchas yet I do, on the same network. I have more privacy enhancing software on my devices. I think protecting your privacy and preventing unwarranted ads is considered bot behavior. This should absolutely be villainized and banned from practice
https://gs.statcounter.com/browser-market-share
I use Firefox nightly which does not even show up statistically...
You're right! I forgot about those. A colleague and I tried to complete it independently but literally could not. One run would take multiple minutes and on the second try I was more diligent (taking even longer) and certain I did all the math correctly, but registration was still being rejected. Our new colleague did not sign up for GitHub that day and got the repository from a colleague who already had access instead
Edit: seems that's yet another one. Arkose <https://www.arkoselabs.com/arkose-matchkey/> is the ones OpenAI used to use on their login page until ~2 months ago, I found them very reasonable (3x selecting a direction an object is facing in), even if unnecessary since I provided the right username and password from a clean IP address on the first try
It's no surprise that if you use a browser that makes everyone look identical and indistinguishable from a bot that you have to solve more captchas. Welcome to the private web future you've always asked for...
1. yes they do. have you ever been to vegas? there's cameras and facial recognition everywhere. outside of vegas, some bars and clubs also use ID scanning systems to enforce blacklists, and in most cases that system is outsourced to an external vendor. finally, ticketmaster requires an account to use, and to create an account you need to provide them your billing information. that's arguably more intrusive than whatever cloudflare is doing, which is at least pseudonymous.
2. "profiling people" might be objectionable for other reasons, but it's not a relevant factor in whether something is a "protection" racket or not. There's plenty of reasons to hate cloudflare, but it's laughable to describe them as a criminal enterprise.
2. Of course it is relevant. Because the more false positives they have the more money they can extort. They have negative incentive for their system to work properly.
P.S. ticketmaster is absolutely criminal, too.
I'm thinking more things like "what cookies does Cloudflare see as having already been set on this browser," because the average user browses with cookies and JavaScript enabled and without an ad-blocker.
Not that I send a lot of messages because I'm aware of the resource consumption, but so it could hardly be that I need to do another "token of human work" when I next open the page when I'm not even logged in yet
> That German ISPs have daily-rotating IP addresses
This is interesting. What is the purpose? Security? Privacy?Excellent short story that’s, somewhat related at least.
this guy is really dumb BUT he has a very high quality computer THUS he is in the managerial class Final -> Ramp up the Ads!
Now I am waiting for two self driving cars to hit each other... they already drive like "American idiots", guess we know what the training model is.
What are the "false positives" in this context? It's specifically for blocking bots, and enrollment into the program to get unblocked is designed for bot owners. It's obviously not designed to extract money from regular users. I doubt there's even a straightforward way for regular users to pay to get unblocked via this channel. As the people who are running blocks and are blocked, I don't see what the issue is. Isn't it working as intended by definition?
Define "bots" in a way computers can understand.
> What are the "false positives" in this context?
Regular users that cloudflare (profiles) accuses of being bots. God help you if you want to block trackers or something else that's not regular.
> I doubt there's even a straightforward way for regular users to pay to get unblocked via this channel
This is part of the problem. But hey, at least they are only a process change away from charging normies too.
How is having a specific definition relevant to this conversation? An approximate definition of "a human using a browser to visit a site" probably suffices, without having to get into weird edge cases like "but what if they programmed lynx to visit your site at 3am when they're asleep?".
>Regular users that cloudflare (profiles) accuses of being bots. God help you if you want to block trackers or something else that's not regular.
I use ublock, resistfingerpnting, and a VPN. That probably puts me in the 95+ percentile in terms of suspiciousness. Yet the most hassle I get from cloudflare is the turnstile challenges can be solved by clicking a checkbox. Suggesting that this sort of a hurdle constitutes some sort of "criminal enterprise" is laughable.
I do occasionally get outright blocked, but I suspect that's due to the site operator blocking VPN/datacenter ASNs rather than something on cloudflare's part.
>This is part of the problem. But hey, at least they are only a process change away from charging normies too.
So they're damned if they do, damned if they do? God forbid that site operators have agency over what visitors they allow on their sites!
Because it's a computer that automatically does it. That's the entire problem here. Humans are not in the loop, except collecting the paychecks.
> An approximate definition of "a human using a browser to visit a site" probably suffices
Humans are not doing the blocking. "Approximate" is not good enough when, for example, I need to go to a coffee shop and use an entirely different computer to trick cloudflare into letting me order from my longtime vendor. And I must repeat that my work computer is doing absolutely nothing interesting. My job and livelihood depend on this.
> without having to get into weird edge cases like "but what if they programmed lynx to visit your site at 3am when they're asleep?".
What about an edge case like 'using your bone stock phone to visit a site once'?
What about all the poor suckers that installed an app that loaded legal software designed specifically to use their phone's connection for scraping a la brightdata? Residential proxies are big business.
There are billions of users on the web. It is one gigantic pile of edge cases. And that's entirely the point. CF may get some right but they also get plenty wrong with no recourse (but now you may be allowed to pay them money for access).
> So they're damned if they do, damned if they do?
Yes. Their entire business model is "we have a magic crystal ball that only stops 'the wrong people'™ from your website".
> God forbid that site operators have agency over what visitors they allow on their sites!
They quite literally don't have that agency. This goes back to "define bot". There are zero websites that would want to block me from making purchases from them and yet that is exactly the result in the end. I had to change vendors for a five figure order because I was up against a deadline and couldn't get around the cloudflare block from my office, and the vendor had closed for the night so I couldn't call them and bypass the whole mess.
Afterwards we spent nearly a week trying to figure out how to let me buy from them again and they were willing to keep going back and forth with CF on my behalf but I was over it and not going to spend any more time. Now I'm using the non-CF vendor to their disappointment. So much for agency.
> I use ublock, resistfingerpnting, and a VPN. That probably puts me in the 95+ percentile in terms of suspiciousness. Yet the most hassle I get from cloudflare is the turnstile challenges can be solved by clicking a checkbox.
Good for you? I have a bone-stock computer on its own connection just to try to work around this BS and yet I still sometimes get an infinite loop where the checkbox never goes away.
When I have my VPN to our euro office on I am 100% unable to access CF sites whatsoever. Been that way for as long as I can remember.
I don't see how "Humans are not in the loop" is a relevant factor for whether something is a "criminal enterprise" or not. Humans are often not in the loop in approving loans/credit cards either. That doesn't make equifax a "criminal enterprise" for blocking you from getting a loan because you can't pass a credit check. Even in jurisdictions with laws against automated decision making by computers, you can only seek human redress in specific circumstances (eg. when applying for credit), not for whether a website blocked you for being a suspected bot or not
>I need to go to a coffee shop and use an entirely different computer to trick cloudflare into letting me order parts on digikey. And I must repeat that my work computer is doing absolutely nothing interesting. My job and livelihood depend on this.
1. At least looking at the response headers, digikey.com is served by akamai, not cloudflare
2. I can visit the site just fine on commercial VPN providers. Maybe there's something extra sus about your connection/browser, but I find it hard to believe that you have to resort to getting a separate computer and making a 10 minute trek to visit a site
3. like it or not, neither cloudflare nor digikey has any obligation to serve you. They can deny you service for any reason they want, except for a very small list of exceptions (eg. race or disability). "browser/configuration looks weird" is an entirely valid reason, and them denying you service on that basis doesn't mean cloudflare is running a "protection racket".
>What about an edge case like 'using your bone stock phone to visit a site once'?
that's clearly not an edge case
>What about all the poor suckers that installed an app that loaded legal software designed specifically to use their phone's connection for scraping a la brightdata? Residential proxies are big business.
That's a false negative, not a false positive. Maybe the site operator has a right of action against cloudflare for not doing their job against such actors, but you have no standing when you're blocked and they're not.
>Yes. Their entire business model is "we have a magic crystal ball that only stops 'the wrong people'™ from your website".
And do they actually claim 100% accuracy?
>They quite literally don't have that agency.
They can go with another anti-bot vendor. Competitors such as imperva or ddos-guard use similar techniques because it's the state of the art when it comes to bot detection.
>This goes back to "define bot". There are zero websites that would want to block me from making purchases from them and yet that is exactly the result in the end. I had to change vendors for a five figure order because I was up against a deadline and couldn't get around the cloudflare block from my office, and the vendor had closed for the night so I couldn't call them and bypass the whole mess.
>Afterwards we spent nearly a week trying to figure out how to let me buy from them again and they were willing to keep going back and forth with CF on my behalf but I was over it and not going to spend any more time. Now I'm using the non-CF vendor to their disappointment. So much for agency.
I'm sorry this happened to you, but any anti-fraud/bot system is going to have false negatives and false positives. For every privacy conscious person that's making a legitimate purchase using TOR browser and delivering to a different shipping address, there's 10 other fraudsters with the same profile trying to scam the site. This is an extreme example, but neither the business or cloudflare has any obligation to serve you.
>Good for you? I have a bone-stock computer on its own connection just to try to work around this BS and yet I still sometimes get an infinite loop where the checkbox never goes away.
What OS/browser (and versions of both) are you using?
>When I have my VPN to our euro office on I am 100% unable to access CF sites whatsoever. Been that way for as long as I can remember.
sounds like their residential proxy detection (that you were asking about earlier) is working as intended then :^)
I edited them out because they were only one of many problem sites.
> Maybe there's something extra sus about your connection/browser, but I find it hard to believe that you have to resort to getting a separate computer and making a 10 minute trek to visit a site
Maybe half a decade ago someone had malware from my IP. Maybe my router's mac address was used by some botnet software somewhere. Maybe I'm on the same subnet as some other assholes.
> 3. like it or not, neither cloudflare nor digikey has any obligation to serve you. They can deny you service for any reason they want
The vendor in question (this one was not digikey) very explicitly wanted me as a customer.
> them denying you service on that basis doesn't mean cloudflare is running a "protection racket".
Them charging to correct their mistake is.
> that's clearly not an edge case
That's my point. I know for sure that vanilla android on t-mobile periodically gets the infinite loop in this area of my city. It usually goes away within a week but there's no rhyme or reason.
> What OS/browser (and versions of both) are you using?
I have seen it on linux windows and android.
> sounds like their residential proxy detection (that you were asking about earlier) is working as intended then :^)
I don't understand this. They have a normal ISP in a business district?
ETA: I have less issues on my home computer, which browser extension'd up, ironically enough.
But the fact that other security providers flagged your IP/browser should be enough to conclude that cloudflare isn't engaged in some sort of "protection racket" to extract money from you?
>The vendor in question (this one was not digikey) very explicitly wanted me as a customer.
Most e-commerce vendors also want customers as well, the problem they can't tell an anonymous visitor a legitimate customer or not, so they employ security services like cloudflare to do that for them.
>Them charging to correct their mistake is.
It's unclear whether the cloudflare product actually constitutes "Them charging to correct their mistake". For one, it's unclear whether you're blocked by cloudflare or the site owner, who can also set rules for blocking/challenging users. Moreover, it's unknown whether the website owner would opt into this marketplace. Presumably they're blocking bots for fraud/anti-competition reasons. If that's the case I doubt they're going to put their sites up for scraping to make a few bucks. Finally, businesses are under no obligation to give you free appeals, so the inability for you to freely appeal doesn't constitute a "protection racket".
>vanilla android on t-mobile periodically gets the infinite loop
>I have seen it on linux windows and android.
you must have a really dodgy IP block then.
>I don't understand this. They have a normal ISP in a business district?
Its probably generating two signals associated with fraud:
1. high latency means than a proxy is being used. This is suspicious because customers typically don't VPN themselves halfway across the world, but cybercriminals trying to cover their tracks by using residential proxies do
2. "business" ISPs might get binned as "hosting" providers, which is also suspicious for similar reasons (eg. could be someone using a VPS as a proxy).
Sure, the unlucky few who accidentally does some online shopping when connected to their work VPN might get falsely flagged, but they probably figure it's a rare enough case that it's worth the loss compared to the overwhelming amount of fraudsters that fit the same pattern.