1. plenty of VPS with many IP addresses (this is easier with IPv6 subnet)
2. HTTP header rearranging
3. Fuzzing user-agent
4. Pseudo-PKBOE algorithm
5. office hours, break-time, lunch-time activity emulation
6. ????
7. profit
I am looking at you, SSH port bashers.
* Change your user-agent to a real user-agent, cycle it frequently.
* Done.
Put your email address in your User-Agent string so they can get in touch if needed.
If this guy got to experience how systemically bad the credential stuffing problem is, he'd probably take down the whole repository.
None of these anti-bot providers give a shit about invading your privacy, tracking your every movements, or whatever other power fantasy that can be imagined. Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.
Here's my scenario: My electricity provider publishes the month's electricity rates on the first of the month, I want to scrape these so that I can update the prices in Home Assistant. This is a very simple task, and it's something that Home Assistant can do with a little configuration. Unfortunately this worked exactly once, after that it started serving up some JavaScript to check my browser.
The information I'm trying to get is public and can be accessed without any kind of authentication. I'm willing to bet that they flipped the anti-bot stuff on their load balancer on for the entire site instead of doing the extra work to only enable it for just electricitycompany.com/myaccount/ (where you do have to log in).
I also asked the company if they'd be willing/able to push the power rates out via the smart meters so that my interface box (Eagle-200) could pick it up, they said they have no plans to do so.
The next step is to scrape the web site for the provincial power regulator, which shows the power rates for each provider. Of course, the regulator's site has different issues (rounding, in particular), I haven't dug any further to see if I can make use of this.
All of this effort to get public information in an automated fashion.
Fortunately there are plenty of tools to handle this, and at a hobby level not particularly resource intensive. Something like this is simple and reliable in many cases: https://github.com/berstend/puppeteer-extra/tree/master/pack...
A product protecting against credential stuffing might as well prevent denial-of-service as well.
- committing clickfraud to game ad and referral revenue systems
- posting fake or spam reviews and comments
- generating fake behavioral signals to help bypass CAPTCHAs to help create accounts on other sites that can post spam comments
- validating stolen credit card details
- screwing with your metrics collection if you can't identify them as bots
All of that is enough reason for sites to use bot detection and blocking technology. The fact that the same tech also has some utility against accidental or malicious traffic-based DoS is also a bonus.
To be clear, "validating" is an industry euphemism for stealing, just for a different purpose. How do you validate the card is live? Run a real transaction through it and mark it based on the result. But what do you run for this real transaction? Well, whatever you want. Typically it'll be something to avoid suspicion as much as possible, but the thief gets to pick what they test it with, so why not pick something that they'll personally benefit from? There are so many online games with purchasable currency these days, it's hard to choose.
And yet if you dare browse the web with TOR or a VPN or sometimes just happen to be on a small ISP[0][1] then you're being punished immediately. You solve your cloudflare-supplied captcha because you may be a bot (you're not, and the dangerous bots will not be defeated by this anyway, but some humans will be), and then you get an error from the website itself because it runs a secondary bot detection thing. And you weren't even anywhere near anything "dangerous" like a checkout page.
[0] My parents use a regional small ISP (but locally very popular) that serves around 50K customers. My parents also use a regional bank (a Volksbank, and those are members of a national association that provides all kinds of services). Suddenly that bank would not even let them see the bank's front page. After some back and forth on the phone support line it turned out the bank had recently deployed some "advanced" bot detection, one that had a whitelist of residential-ISP-associated AS/IP ranges, and of course whoever compiled and maintained that list had forgot to include that small local ISP. For that regional bank it meant they had just shut out a very significant number of their customers (and potential customers just trying to look up what the bank offers), as there was very likely a huge overlap of people using that regional ISP and that regional bank (both are regional, after all). It also was something they couldn't fix themselves, as the "online banking" stuff was not in-house but was run by the national association (which probably used some bot-detection as a service provider). It took the bank (or rather, the national association) a few weeks to fix. Mind you, the last few years that bank has been heavily marketing a cheaper "online only" account, only online banking, online support, and access to the self-serve ATMs and banking terminals, but no face-to-face or even ear-to-ear human interaction. Try contacting "online support" about "the website outright refuses me" when the "online support" is only available on that website. Kudos is you're smart enough to switch to their mobile app, as your phone uses a different ISP, unless you forget to turn off wifi. That's the advice my parents got from the phone support (they sill have an account type that not online-only).
[1] When I recently visited my parents, from the wifi [same ISP as in 0] I couldn't open the website of a bakery too look up if and when they would be open on a Sunday. Some error message about "this website is not available in your network" (English text, for a German bakery... suspicious :P). I could open it via my mobile, tho. I could open it from my regular ISP when back home (in another city) again. Mind you, that website is purely informational and has no "interactive" features let alone let's you buy anything. It's just static text and some pictures.
Stuff like this is a pain beyond pain. I really hope that the clients you mention know that they piss off a proportion of their users with every move they take.
Nobody wants to spend time trying to stop these bots. It is, however, a very necessary thing to do.
I’ve noticed most sites won’t let you search business fares efficiently, so I made my own for Google Flights which only worked for like 6months until they added bunch of changes that made it near impossible to scrape.
Well, because protection is not a binary thing - either being 100% safe or 100% not working - instead it's a proportion between the skill/effort/time needed to break in, and the reward you get for it.
To stop majority of attacks you don't have to be absolutely unbreakable, you just need to make it hard enough for majority of attackers so that it doesn't payout for them compared to the value of the data you're protecting. And that's where anti-bot SW has it's place, it slows done spiders and global attacks, forcing for custom tailored scraping that is constantly being fine-tuned, infrastructure to hide your IPs, and that makes the operation way more expensive and harder to run continuously...
Now the best part... one division (big team) of our company worked for the (national carrier) airline , one division of our company worked for the resellers (we had a single grad allocated to web scraping). The airline threw ridiculous dollars at trying to stop it, and we just used a caffeine fueled nerd to keep it running. It wasn't all fun though, they'd often release their new anti scraping stuff on a Friday afternoon. They were less than impressed when they learnt who the 'enemy' was. Good times!
I disagree. Obviously there is no way to 100% stop scraping, but a for a rather small amount of $ you can implement some measures that make it harder. Services like https://focsec.com/ offer ways to detect web scrapers using proxy/VPNs (one of the most common techniques) for little money.
> Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.
Keep in mind that they may be legally or contractually forced to do this. Think of Netflix who are investing heavily into their Anti-VPN capabilities, most likely because they have contracts with content publishers & studios that force them to do so.
There is vast amount of profit available in doing just that (see e.g. GOOG and FB market cap). Even companies that truly have no intention of exploiting data collected as a side-effect of whatever product line they do, nearly always eventually end up going for that profit line. Because passing up on more income merely due to moral considerations is too much of a temptation for a company to be able to resist in the long term.
How is there not an equilibrium here that cuts off credential stuffers? I'd naively imagine the residential IP providers have some measure of bad actors they themselves use to determine if a client is worth it, and that someone getting all your IPs blacklisted would get dropped pretty quickly.
I don't know about $10m/year, but many sites block bots just because they don't want competitors to access publicly available data. Which is bullshit.
I know how bad this issue is, and I wouldn't take down this repository. Anti-bot software does not work, anyone who pays 10m per year to have it simply has too much money.
2FA, especially app based, has been proven to work really really well.
2FA is a good security feature but it does not help against web scraping. Credential stuffing and other 3rd party attacks? Yes, it _can_ help. But it does not always help. There's a phishing group that has seemingly specialised on getting people to click the green confirm button in their Duo app... ¯\_(ツ)_/¯
Check https://github.com/revalo/duo-bypass for a python script that can be used to automate Duo tokens... Has some code from me. There are similar scripts for all the other well known OTP Apps...
It sounds great but it is a completely ignorant thing to say.
The solution to the automation problem is to do what most companies do and have registered API integrations.
Also, in my experience, most websites that block your bot, block your bot because your bot is too aggressive, or because you are fetching some resource that is expensive that bots in general refuse to lay off. Bots with seconds between the requests rarely get blocked even by CDNs.
You use this software at your own risk. Some of them contain malwares just fyi
LOL why post LINKS to them then? Flat-out irresponsible... you build a tool to automate social media accounts to manage ads more efficiently
If by "manage" you mean "commit click fraud"This kind of indirect scraping can be useful for getting almost all the information you want from sites like LinkedIn that do aggressive scraping detection.
The creator of that plugin does mention it is very much a cat and mouse game, just like most of the “scraping industry”
https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
I run a no-code web scraper (https://simplescraper.io) and we test against these.
Having scraped million of webpages, I find dynamic CSS selectors a bigger time sink than most anti-scraping tech encountered so far (if your goal is to extract structured data).
CouchSurfing blocked me after I manually searched for the number of active hosts in each country (191 searches), and posted the results on Facebook. Basically I questioned their claim that they have 15 million users - although that may be their total number of registered accounts, the real number of users is about 350k. They didn't like that I said that (on Facebook) so they banned my CouchSurfing account. They refused to give a reason, but it was a month after gathering the data, so I know that it was retaliation for publication.
LinkedIn blocked me 10 days ago, and I'm still trying to appeal to get my account back.
A colleague was leaving, and his manager asked me to ask people around the company to sign his leaving card. Rather than go to 197 people directly, I intentionally wanted to target those who could also help with the software language translation project (my actual work). So I read the list of names, cut it down to 70 "international" people, and started searching for their names on Google. Then I clicked on the first result, usually LinkedIn or Facebook.
The data was useful, and I was able to find willing volunteers for Malay, Russian, and Brazilian Portuguese!
After finding the languages from 55 colleagues over 2 hours, LinkedIn asked for an identity verification: upload a photo of my passport. No problem, I uploaded it. I also sent them a full explanation of what I was doing, why, how it was useful, and a proof of my Google search history.
But rather than reactivate my account, LinkedIn have permanently banned me, and will not explain why.
"We appreciate the time and effort behind your response to us. However, LinkedIn has reviewed your request to appeal the restriction placed on your account and will be maintaining our original decision. This means that access to the account will remain restricted.
We are not at liberty to share any details around investigations, or interpret the terms of service for you."
So when the CAPTCHA says "Are you a robot?", I'm really not sure. Like Pinocchio, "I'm a real boy!"
LinkedIn has to deal with a lot of scummy recruiters and scammers; I don't blame them for being very strict.
Why is it so difficult to just respect robots.txt? Maybe there's an idea for a browser plugin that determines if you can easily scrape the data or not. If not, then the website is blocked and then traffic will drop. I know this is a naive idea...
I'm currently looking for ways to get real estate listings in a particular area and apparently the only real solution is the scrape the few big online listing sites.
That’s one of the bigger ones. Unfortunately recent events means scraping is still a gray area.
https://en.m.wikipedia.org/wiki/Van_Buren_v._United_States
I think it only applies to systems that aren't available to the general public, which in this case was the GCIC. Anything that is available to the public, even if it requires some sort of registration, would I think be legal to scrape. YMMV though.
More related to the submission content -- at the time we used rotating proxies, both in-house & external (ProxyMesh - still exists & only good things to say about it); they allowed us to "pin" multiple requests to an IP or to fetch a new IP, etc...
This whole field of scraping and anti-bot technology is an arms race: one side gets better at something, the other side gets better at countering it. An arms race benefits no one but the arms dealers.
If we translate this behavior into the real world, it ends up looking like https://xkcd.com/1499
Nobody wants to scrape, it's messy and fickle and a general pain in the backside. But sometimes the data you need exists only in that form.
If you run a website and you have a problem with scrapers, then make all that data available through an API and say what acceptable rate limits are. If cost is an issue, then charge a proportionate fee, my time writing a scraper is worth much more than paying a few dollars for an API.
If you just say "No" to everything then you lose all control over the process and the only outcome will be such an arm race.
I am curious by what the author means by automating social media accounts to manage ads more efficiently
I think a better solution is to implement 2FA/MFA (even bad 2FA/MFA like SMS or email will block the mass attacks, for people worried about targeted attacks let them use a token or software token app) or SSO (e.g. sign in with Google/Microsoft/Facebook/Linkedin/Twitter who can generally do a better job securing accounts than some random website). SSO is also a lot less hassle in the long term that 2FA/MFA for most users (major note: public use computers, but that's a tough problem to solve security wise, no matter what).
Better account security is, well, better, regardless of the bot/credential stuffing/etc problem.
Or to put it another way, naively, having api.example.com and realpeople.example.com separated out into separate sandboxes seems reasonable, but due to the aforementioned problem, its not. But then it also turns out to be the wrong axis for this anyway, and you need your monitoring to work for you.
But for some websites, even residential ips doesn't let you pass.
I noticed there is like a premium reCaptcha service, which just work differently then standard one and not let you pass. It's mostly shown with a Cloud flare anti bot page.
If you really care but don't want to spend the money, just block the subnet each time you see a Googlebot request. "whois w.x.y.z" returns an entire CIDR, and it seems unlikely to me that Google is scraping from a bunch of disconnected /24s.
- trying to disrupt business processes (eg: false referral listings, gift card scams, etc)
- trying to disrupt systems
I'm sure there are folks who use bots and scrapers for home automation, but these users generate marginal traffic in comparison. The real cost, aside from successfully achieving the points above, is the bandwidth and hardware costs that become overhead. Bots are usually coded with retry mechanisms and ways to change connection criteria on subsequent retries.
- Reselling aggregated data
- Competitive pricing and inventory data
- "Sniping", like with auctions, event tickets, or things like airline check-in processes that are first-come, first-serve
- Weird SEO stuff where people scrape content in the hopes that isn't indexed yet, and they can beat you to it.
- And, sort of in the space you mentioned, searching for existing vulnerabilities by various signatures, or trying to brute-force guess things like passwords.
I know this is off-topic, but I'm really curious. How does scraping the web help with home automation? Maybe downloading weather data could help, but crawling the web? I think I'm missing something about home automation.
Imagine if the goal is images and videos, now you've got yourself some heavy duty scraper that could cost the website owner lots of data fees.
So you do some thing which once a day scrapes a site and pulls off some data that you use in your thing. Maybe you talk about it to friends, or you have this thing as one of your github repositories. Some of your friends download the repository and also start using your thing. They talk to people about how cool your thing is, or what it does and the nice convenience of automating something that you used to have to do manually.
There are 86,400 seconds in a 24 hour period, probably folks won't change your code at all at first, and as it diffuses into the community some webmaster starts seeing this weird spike of queries that happen once a day at some time. Different addresses but always the same kind of request.
It's not a problem when its like 10 or 20 qps burst but when it starts getting up to 100 - 200 or worse 1000 - 2000, it causes the system to perhaps spin up additional instances that it isn't going to need after the burst and waste money. So the webmaster starts denying those requests with a 404.
Now sometimes your code works and sometimes it doesn't but you don't know that the webmaster is fighting you yet. Maybe eventually you start randomly varying the request time, or the people who have copied your thing are in more varied time zones so you it starts getting spread over the day.
now the webmaster is seeing bursts of traffic nearly every hour on the hour and that is weird so a more aggressive mitigation strategy is enacted.
People using your thing complain that it keeps breaking so you look into it and realize that the site is trying to block your requests. Perhaps you don't understand why this is, or perhaps you do and don't care, either way you come up with some strategies that avoid the block (maybe your rotate the user-agent or something).
Now the query traffic is spiking again and the webmaster is getting complaints that this 'bot traffic' is resulting in useless AWS fees because it isn't part of the revenue traffic and it is forcing the service to add more resources for their customers.
Not all scrapers are malicious, but my experience is that it is rare that a non-malicious scraper application isn't talked about and shared (amongst people who have a similar itch that the thing is scratching) and because its all open source it spreads around.
By the way great work on Marginalia search engine, I love it.
A small number of sites has blocked my crawler , but that's almost always been my own fault, and happened a few instances when the crawler was misbehaving and actually fetching too aggressively (or repeatedly). In every case just sending an email to the site explaining what happened and humbly asking for a second chance been enough to be allowed back in.
Most website owners don't seem to mind small search engines at all, what they don't want is scrapers that aggressively scrape their entire site 10 times a day, ignoring robots.txt, and being a general nuisance.
"Legitimate uses" is what the site operator says it is, nothing more nothing less. There are no laws that says you can scrape a site and circumvent their protection against doing so.
Maybe respect user freedom? If I can access the data using my browser, why can't I access it using my script?
Why is Google the only one who can do it? Must they have yet another monopoly?
> Bots with seconds between the requests rarely get blocked even by CDNs.
I've had scripts that made 1 request per day get blocked for no reason. Not to mention the endless cloudflare javascript bullshit they made me support for it to even work.
I run a search engine and do my own crawling, and this does not correspond to my view of reality at all.
I have had almost no problems with getting blocked. If I have gotten blocked, it's usually been my own fault and I've been able to get unblocked by sending them an email explaining I run a search engine and asking for forgiveness because my bot wasn't behaving well.
The bots that do get blocked are bots in most cases bots that misbehave, ignore robots.txt, fetch the same resources repeatedly or with insane crawl-delays.
There are a few rare exceptions, but the whole "why does Google get a free pass when I don't?"-angle just doesn't hold water at all.
> I've had scripts that made 1 request per day get blocked for no reason. Not to mention the endless cloudflare javascript bullshit they made me support for it to even work.
Google doesn't repeat requests every day. I don't repeat requests every day. That's a weird thing to do, and it's well within a site owner's prerogative to block that nonsense.
> Why is Google the only one who can do it?
Because site operators explicitly allow them automated access. If you want the same treatment you have to ask for it.
Google can do it because most website operators want them to index their site. Plus, it is trivial to tell google to stop. That goes for all search engines.
either due to neglegance (Apple Store developer console),
Security (Google Play Store accounting data),
Or financial gain (AppsFlyer "premium API").
Tell that to any Cloudflare site on security level High or "I'm Under Attack!" year round.
Obviously you haven't used Instagram recently.
Some of these so-called "advanced" techniques:
* We use our own mobile emulation software (similiar to bluestacks). Turns out, mobile helps with a lot of things (below).
* We use mobile IPs only. Mobile LTE data users are behind CGNATfor IPV4. You can't block one ip without possibly blocking hundreds of innocent IPs using the same exit point.
* All you need is a new useragent and browser fingerprint; combined with emulation + mobile IPs, there's really no easy way for companies to block this.
* With the advent and ease of virtualization; we avoid using any headless browsers. Seriously, if you can, never use headless. This should be close to rule number one for anyone looking to operate any kind of scrapers. All of our scrapers are run in isolated virtual instances with full mobile browsers.
* We can easily reset our device identifier, device carrier, simulated SIM information, and especially important is the Google advertising ID that is set per device; the list goes on. The key here is #1, our mobile emulation software.
* Our automation scripts are a combination of human recorded set of actions which we then perfected and can run in certain loops (for some of our data).Also, does this work only for browsers or also for mobile apps? I have always assumed that it is always theoretically possible to get data from browsers (very extreme resort is save the browser page / (screenshot + computer vision)); but it can be impossible to get data from apps (especially ios). Are my assumptions correct?
Also can you explain mobile IPs more? If they are such a big vulnerability, why is there no potential solution to them?
> You can't block one ip without possibly blocking hundreds of innocent IPs using the same exit point.
This is similar how Tor is supposed to work in practice. Make everyone look like the same user so these companies can't tell who's who.
When treated like a puzzle it can be really interesting. So I thought I'd share a few tidbits.
1) We did a simple 'speed' test, how many queries per second were coming from an IP, and auto-ban on the limit being exceeded, started at 100qps and watched as the traffic moved down to 99.5qps. Pushed to 10qps and watched the traffic follow it down. Even at 3qps you would get traffic at bit over 2 qps trying to limbo in under the limit.
2) At that time, lots of people who highjacked browsers with toolbars sold scraping as a service to third parties. Their toolbar would check in to see if it should do a query and it would launch a query and return the results without the user even knowing. One company, 80 legs, was pretty up front about their "service", SEO types would use it to scrape Google results to see how their SEO campaigns were doing.
3) The majority of the traffic had criminal intent, looking for metadata on web pages to indicate they were running an unpatched version of some store software or had sql injection bugs. These would often come from PCs that had been compromised for other purposes or "zombie" PCs. We could rapidly map out these networks when we got 100 queries from 100 different IPs looking for "joomla version x.y",p=1 through "joomla version x.y",p=100. We briefly played around with sending them official looking SERPs but all the links went to fbi.gov though an obfuscator.
One of more effective strategies was to field a "black hole" server, basically it was an http server that answered like you had gotten hold of it but then it never sent any data. With some simple kernel mods these TCP connections were silently removed on our end so they took no resources and the client would wait basically forever. We ack'd all keep alive packets with "Yup, we're here." so they just kept waiting and waiting.
It really was a never ending game. We mass banned an entire Ukranian ISP because out of billions of queries not a single one was legitimate.
In this specific case, the people wanting to detect bots want to avoid having their signals burned, the scrapers don't want the defenders to know which signals they're able to cloak since it will spur new signals development. So what gets disclosed publicly is just the really simple stuff.
It's kind of sad. There's a huge pipeline for getting people up to speed on security engineering, since there's a lot of incentive for everyone in the ecosystem to share information as publicly as possible except for relatively short responsible disclosure windows.
In contrast, the only way to learn abuse engineering is to happen to work in an organization with an abuse problem (and live with the frustration of an abnormally long ramp up period), or to go black hat. And likewise it's quite hard for the good guys to actually learn from each other, since they're spread across so many companies and it's thus hard for them to exchange information on what works and what doesn't.
Social media platforms has made physical appearance as the first class citizen of the reputation economy. I'm not even talking about those platforms which outright bury content from the disabled as a policy, I'm talking about those platforms whose algorithms favor selfies, videos over text/URLs and thereby putting those with accessibility issues in severe disadvantage.
Why would you use such platforms one might say, Do something which has nothing to do with the reputation economy they might add; Well have you looked at LinkedIn lately? LinkedIn has become ubiquitous with professional job search and 30 second video intro is the very first thing on the profile, not the skills which the platform was meant to be when it was launched. One must be naive to claim that the physical appearance on that video or profile picture doesn't affect the job prospects(Several studies have stated otherwise).
It's not just the physical appearance, The action of creating videos or posting photos itself is hard as a time-constrained person[1] and so I think it's reasonable to ask the platform to allow bots to post deep-fake videos of the user doing silly things which these platform expects from an average user.
Together, there's little to detect different than a regular user. The reason why the residential IPs is given heavy importance though is that it's the one part that costs a lot of money if you need enough of them you need to use a proxy service and you transfer a lot of data. Entry level pricing is over $15/GB for high quality services.
At the bare minimum, TFA stops most attacks. That's a whole lot better than the current situation.
It seems that the Duo core app is a variant of HOTP?
What's the name of the phishing group and any details on them? There was a Defcon or Black Hat video where they would constantly send a push approval to the mobile which was not PIN protected and most people would click on it. Don't remember which OTP generator it was.
To find more IPs, make your own website, and wait until GoogleBot eventually shows up :)
That's totally understandable though. I insist on meeting in public as well... It's the real world, safety is most important.
The thing with websites is they already allow me to make thousands of requests from my browser. What harm does it do if I make a bunch of requests from a script? I don't see it.
> The solution to the automation problem is to do what most companies do and have registered API integrations.
Yeah, those are pretty great. I always use those whenever possible. Many of the sites I use lack those though. Some have APIs so badly designed that scraping their web site actually results in fewer requests and less overhead.
I generally appreciate registered API integrations, but the trouble is, for the most sites that are most problematic for benign automation, they usually don't have enough demand or revenue to justify well-maintained APIs.
I tend to think the solution is more to somehow make the market prefer more decentralized solutions, preferably federated. Not having one big target for bad actors means much less effort applied to attacking any one of the targets.
Dating might be a bit off topic, but I can see both sides as well. Women have genuine risks, and are very justified in taking precautions. But for the majority of decent guys, it can be tiresome to be constantly treated like you're an evil violent stalker. Maybe it needs a similar solution - a movement to local connections where people can have reputations that you can trust.
That's exactly my case though. I have a few scraper scripts that I've never published. So what if it's rare? Do I deserve to be treated like a botnet just because it's inconvenient for some webmaster or company to do otherwise? That's not fair at all.
I'm sure there are others, just from the top of my head:
* Electricity prices (as OP mentioned). Especially for people with solar panels or multiple options for heating.
* Watching for availability/prices of products or new homes one might be on the lookout for. Notifications at price drops/availability
* Public transport: next bus/trains from closest station, delays and interruptions
* IMDB/tvdb/etc for monitored shows and movies. Common with sonarr.
* Air quality, covid outbreaks, whatnot
I scrape one site whose content changes every second. How is it "nonsense" to make one request every 24 hours? I make hundreds, thousands when I browse their site normally using my browser.
We've got people in this very thread talking about bots making hundreds of requests per second. How is one request every 24 hours harming anyone? People told me to make one request per hour to avoid hammering their servers, I decided to wait a day instead. It boggles my mind that this generous interval could possibly be considered abuse. How long should the interval be then? A month? A year? Infinitely long so the scraper never makes requests?
Smaller ISPs are also more likely to have issues with CPE getting compromised and routers running botnets within the comfort of your home. There are services which invite people to sell their residential bandwidth in return for money, this can potentially have a disproportionate impact at smaller sample sizes.
Networks can declare themselves to be ISPs. You could check if your ISP shows up as an ISP in peeringDB.
https://networkengineering.stackexchange.com/questions/44585...
They are reportedly very proactive when it comes to CPE security, as well, up to giving customers a proactive phone call when they see somebody is using equipment with known vulnerabilities (customers are allowed to operate their own equipment as long as it is deemed compatible, most will use remotely managed equipment, tho, I believe; my dad used to use his own DSL router and once got such a call if I remember correctly. He switched over to their fiber now and managed equipment).
Their AS is indeed identified as an "ISP" in peeringDB.
While I would be extremely surprised if the company was doing shady things, you surely got a point that a small ISP like that could suffer more in reputation from some few customers being up to shady things, including sub-lending the line. I am pretty sure that is against the ToS, but enforcement is a problem of course. Especially detecting such traffic without violating German privacy laws is probably a difficult task, but it's not impossible.
--edit--
Nevermind, here it is: https://support.cloudflare.com/hc/en-us/articles/36003538743...
It should skip to "I run a good bot...", but if it doesn't, that's where you want to scroll.
Mailinator would do a similar thing with their custom email server hardware. Since they didn't really use sockets in the traditional sense, they were happy to give slooow replies and never disconnect "bad" connections.
If you don't mind, can you elaborate on the 2nd point? Specifically, what do you mean by hijacking browsers with toolbars?
Lots of people did it, even Blekko (although we stopped after we figured out it was just a scam), and the way it worked is toolbar company X would approach your internet site and say "we can send you a lot of traffic, just use this toolbar, we'll even pay you every time one gets installed."
Anyway, the toolbar would hook itself into the address bar and search selection hook in the browser and redirect every search to where ever it was told to. The nefarious part is that these tool bars were often shipped to the company as binaries not source, so you didn't really know everything they were doing. Once we figured out they were scumbags we also found out that they did some really scumbaggish things.
We stopped using them but lots of people did and finally the browser makers re-wrote the browsers to make what they did either impossible or easy for the user to revert and those guys rolled up their shops and went on to become some other type of scam.
There's various captcha solving services where you pay in bulk per captcha and submit data via an api.
It's like writing a game bot with Java robots and pixel detection. It may is inefficient, may takes longer to make than a network solution. But I have yet to be detected anywhere
Primarily it is for automating web applications for testing purposes
Here was I thinking it was just a tool to scrape all the js I am too lazy to reverse engineer.
The majority of them want to do something completely benign like see a BBC show in the US, or watch an American football show in the UK, and the one defining feature of capitalism is its many contradictory faces.
One corporation wants to arbitrarily limit its customer base, and the other corporation wants to arbitrarily limit what data its customer base can see. In between the two is a space big enough to drive a Mack truck or a lorry through, depending on where you're from...
One might justify this until the one corporation merges with the other and then you have a situation where the same corporation wants to do two different things to the same pool of users.
I found out while sorting through business contributors to a non-profit once that a pretty big market for mobile relays is the "free VPN" offered on the app stores to high school kids looking to circumvent the outgoing blocks on the school's wifi.
In that case the school's administration could easily purge the "free VPN" of local users by removing the wifi restrictions. Instead, they serve more traffic to more nefarious places to maintain an illusion of control not for the kids in the school, but for themselves.
All of this is basically Dr. Strangelove but with spyware rather than nuclear bombs.
Not really. In a lot of cases websites use JavaScript to call some API along with some on the fly generated token to prevent abuse.
As long as that token isn't captcha you can reverse engineer the site to do scraping without javascript and that is so much faster than browser based scraping.
Its just a cat and mouse game. After few year I think hardware attention etc will come to play which can mitigate bot issue somewhat.
Once we're all on IPv6 we can go back to blocking IPs. But then IPv6 creates its own problems.
Basically what's different about mobile and laptop IPs?
Your home or local coffee shop might have 10-20 users max behind a single IP. A mobile network on the other hand might put most a small-medium city behind 3 or 4 IPs.
100% reliance on a phone which is easily lost, broken, stolen, etc. without backup is really bad IMO. My bank (Revolut) only had a mobile app, and no way to contact them outside of it (I tried...) I need to switch banks.
1. https://blog.revolut.com/introducing-the-revolut-web-app/
The solution other than with ddos is to make the app better. Caching, access control and limiting, etc.
In most cases you have very limited ability to decide what other people cannot do. And other people has mostly infinite choices of what they can do. I never heard anything as broad as you said. What you said is like a person standing on the street with a T-shirt says "do not look at me more than twice" and claim it has a legal binding to the whole world.
It kind of does though; if I own a store and say that only people with hats can enter, then I'm free to do so. Silly? Yes. Legal? Also yes.
There are some circumstances where it's not legal, mostly centred around discrimination. Details on this differ per jurisdiction, but generally speaking you have a right to refuse customers.
To me it seems sending a http request is somewhere inbetween looking (legal) and entering (illegal if not permitted).
However, most important is that the web as a system makes positive interactions easy and negative difficult. We have already found some set of constraints achieving this for interactions in public city streets. But it's not obvious the same rules (that we have internalized) have the same effect in another medium of communication.
Sure. At least then you're being honest. If you hate me, it doesn't matter what user agent I use to access your site. Browsers, scripts, they are all me.
I totally can though. If sign a document saying another person can do such and such on my behalf, that person can totally do that. Yes, even at the bank. No different from a user agent, really.
Also see: https://www.ft.com/content/0e746280-e72c-4087-9c0d-df2a7af82...
Feb 2021 FT: "Bills mount in Texas power market after freeze sends prices soaring: Financial casualties emerge as grid operator Ercot requires billions in payments "
Thanks to deregulation, the electricity rate isn't the only thing you pay for though. There is also a Transmission Charge, Distribution Charge, and Local Access Fee. These are all per-kWh charges and change very rarely.
In October, my electricity rate is $0.10730/kWh, but my total cost is actually $0.16346/kWh plus the per-day charge ($0.202/day). Tomorrow the November rate will be published.
[0] https://ucahelps.alberta.ca/cost-comparison-tool-result.aspx...
It seems to me from a life-safety angle that their energy would likely be far better spent on recommending smoke alarms, CO meters, and periodic cleaning of dryer vents than on recommendations against sleeping with washing/drying machines running.
In short - if there's a fire it's much better that you're awake and up already.
https://www.flickelectric.co.nz/pricing-and-plans
https://www.pauatothepeople.co.nz/cheap-as.html
Aside from this, peak / off-peak pricing is not unusual.
Of course if everyone did this, then all of your logins would have the same password (your email login).
Can you say more about this? What do you mean by "Once you get to selenium it's usually over", and how do you manage cold starts in Selenium and emulating heavy usage?
Say your program starts right now, I assume you don't go through "adding heavy usage" to "warm-up", then get down to business, correct?
Travel information is also one of those services where it’s not weird for a significant number of their users to use it quite heavily, making behavioral detection more difficult.
They won't cash out on them or buy items. Instead, they'll collect cards from a source (skimming, hacking, whatever), validate them by adding them to a website that does an authorization (those $1 checks that never get committed). They can then sell them wholesale for a premium compared to non-verified cards.
I've been away from that world for a while but remember that more serious operations will separate the cashing out part (either money or goods) from their acquiring / validating operation because the former carries more risk.
There's also an interesting episode of the darknet diaries podcast (https://darknetdiaries.com/episode/85/) about card cloning which I found interesting.
Not necessarily. Some online services, particularly shops that do home delivery, may give their customers the possibility of adding a card to their wallet and perform a verification as part of the process. As a result, it becomes possible to validate a stolen card number without performing an actual transaction.
The most amusing to me are the ones from “Microsoft” to alert me that they have detected malware on my computer.
What do you mean?
This sketchy company lets mobile app developers monetize user base by letting other people pay $$ to route requests through random people’s mobile IPs: https://brightdata.com/
Similarly NordVPN owns Oxylabs (who mostly hack routers and cameras and sell those as residential IP’s).
Mostly to avoid hitting 'ddos protections' or other security bullshit that doesn't really make sense on 1 daily request or so.
I think these companies used to go for extension developer now it seems they have found new idea to implant malware on apps which is not easy to detect.
As a scraper operator on a mobile data connection all you need is a new useragent and browser fingerprint, there's no easy way for a scraper-blocker-operator to tell that you're not a totally new person.
With all due respect, if the tech can make a large impact on the problems mentioned above, I'm sure it's an easy decision for the big companies to take decimating bot activity over the tiny minority of users who proactively decide to disable JavaScript.
Said as someone who uses NoScript, FWIW.
I'm a linux user myself, so I know for a fact that neither my previous employer, nor other bot vendors, will block linux user agents in particular. Customers generally don't mind a universal requirement for JS execution, so that's just a fact of life. We generally did try to avoid blocking privacy focused browsers, though. We certainly monitored false positive rates and knew pretty well how we affected users.
Have you tried not using these things? Anonymity is exactly what bots want. They want to be able to post a spam message every single second and be impossible to ban since they are anonymous. The internet can't function if people are allowed to be anonymous.
You must have missed the first 20 or so years of its existence, if that's your position.
Stopping abuse has always been a game of trying to deanonymize users in order to try and ban the harmful ones.
All of these have independently caused me to get into endless ReCaptcha loops: firefox on android, smartphone with unusual screen resolution, clean browser profile with VPN.
It's so common that I now default to using duckduckgo, which never blocks me. I doubt DDG has a lower DDoS/Resources ratio than Google. Some companies are just lazier and less principled than others.
"Unusual" = not Chrome and doesn't allow tracking scripts.
Switch to Safari with an ad blocker for a week, see how many more ReCaptcha prompts you get.
Fortunately, almost all of the websites I visit with my anonymized browser aren't places that I wish to attempt to post a message. Unfortunately, I can easily run into defenses of an entire site when the problem is spam sending.
Parent poster trusts his bank, and his bank would trust his once it knows he's not an fraudster, so maybe it's in everyone's interest to just allow the javascript for that one site.
This is a scenario where you have a server explicitly saying "Stop! You are not permitted to access this computer!", and yet you persist in circumventing that by hiding your identity and accessing it anyway. Those are some murky waters.
It depends on who the server operator is. If it's your server, yeah, anyone I don't want to be there should go away. If it's your enemy's server, the argument that they're sending that page to the rest of the Internet turns out to be a decent one.
[0] https://www.eff.org/deeplinks/2018/04/scraping-just-automate...
Maybe we need a status code that means ‘lay off all the requests made from this entire system’?
You are both wrong: copyright law both says you can't (in some cases for some uses) and that you can (under implicit license, fair use, and other rules) in others.
For example:
curl "https://www.ryanair.com/api/booking/v4/en-gb/availability?ADT=1&CHD=0&DateIn=&DateOut=2021-11-15&Destination=BER&Disc=0&INF=0&Origin=MAN&TEEN=0&promoCode=&IncludeConnectingFlights=false&FlexDaysBeforeOut=2&FlexDaysOut=2&ToUs=AGREED" | jq401 Unauthorized
to mean you are authorized to access the resource?
> Although the HTTP standard specifies "unauthorized", semantically this response means "unauthenticated". That is, the client must authenticate itself to get the requested response.
So it would seem that it actually doesn't positively imply that you're NOT authorized.
Which kind of makes sense; machines can't detect legality of things, just that certain procedural niceties haven't been observed.
> The client does not have access rights to the content; that is, it is unauthorized, so the server is refusing to give the requested resource.
Machines don't have any legal responsibility, bot-operators do. Which is why respecting these things is sort of important. At any rate, 40x does not mean "try again with a different user agent and another IP"
In practice there's of course nuance, like anyone will occasionally type in the wrong password on a log-in screen, maybe try again and then realize it was the wrong log-in prompt. That's mostly fine.
That's different from deliberate trying to circumvent a measure like this. If you are doing the stuff in the link, you are absolutely crossing a line and you know it.
There's a large difference between "I got a 403 so I hit F5 once" and "I got a 403 so I used a residential proxy and spoofed my user-agent".