Avoiding bot detection: How to scrape the web without getting blocked?

Avoiding bot detection: How to scrape the web without getting blocked?(github.com)

586 points by proszkinasenne2 4 years ago | 298 comments

bsamuels 4 years ago |

> I need to make a general remark to people who are evaluating (and/or) planning to introduce anti-bot software on their websites. Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks.

If this guy got to experience how systemically bad the credential stuffing problem is, he'd probably take down the whole repository.

None of these anti-bot providers give a shit about invading your privacy, tracking your every movements, or whatever other power fantasy that can be imagined. Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

Mister_Snuggles 4 years ago | |

I wish they'd limit it to just stopping credential stuffing.

Here's my scenario: My electricity provider publishes the month's electricity rates on the first of the month, I want to scrape these so that I can update the prices in Home Assistant. This is a very simple task, and it's something that Home Assistant can do with a little configuration. Unfortunately this worked exactly once, after that it started serving up some JavaScript to check my browser.

The information I'm trying to get is public and can be accessed without any kind of authentication. I'm willing to bet that they flipped the anti-bot stuff on their load balancer on for the entire site instead of doing the extra work to only enable it for just electricitycompany.com/myaccount/ (where you do have to log in).

I also asked the company if they'd be willing/able to push the power rates out via the smart meters so that my interface box (Eagle-200) could pick it up, they said they have no plans to do so.

The next step is to scrape the web site for the provincial power regulator, which shows the power rates for each provider. Of course, the regulator's site has different issues (rounding, in particular), I haven't dug any further to see if I can make use of this.

All of this effort to get public information in an automated fashion.

kevin_thibedeau 4 years ago | | |

At a minimum any scraper that doesn't execute JS needs to impersonate a screen reader user agent. Locking out disabled people has to be many levels of illegal in most countries.

xnyan 4 years ago | | |

Unfortunately, the days of reliable non-JavaScript capable scraping are over.

Fortunately there are plenty of tools to handle this, and at a hobby level not particularly resource intensive. Something like this is simple and reliable in many cases: https://github.com/berstend/puppeteer-extra/tree/master/pack...

judge2020 4 years ago | | |

> I wish they'd limit it to just stopping credential stuffing.

A product protecting against credential stuffing might as well prevent denial-of-service as well.

S_A_P 4 years ago | | |

I’m not sure what rate you are trying to get but the electric market in the US has 5 minute settlement periods. So for your region you would need to grab the price for each period and average that to get a power rate. Take that rate and add transmission fees, taxes and various other fees your provider tacks on then multiply that by usage. In Texas you can go directly to the ERCOT site and get these prices and not worry about counter measures. I’m not sure where you are but there is likely a similar whole sale site that you can access.

cebert 4 years ago | | |

Have you considered using Playwright to automate that instead?

walrus01 4 years ago | | |

Out of curiosity how is that you have electricity rates that change every month? Are you buying power through a third party organization? The vast majority of place I've seen have a fixed tariff for residential use that changes no more often than every 12-24 months.

herbcso 4 years ago | | |

Check if your electricity provider offers Green Button Download. https://www.energy.gov/data/green-button

jameshart 4 years ago | |

Bots aren't just trying credential stuffing. They are:

- committing clickfraud to game ad and referral revenue systems

- posting fake or spam reviews and comments

- generating fake behavioral signals to help bypass CAPTCHAs to help create accounts on other sites that can post spam comments

- validating stolen credit card details

- screwing with your metrics collection if you can't identify them as bots

All of that is enough reason for sites to use bot detection and blocking technology. The fact that the same tech also has some utility against accidental or malicious traffic-based DoS is also a bonus.

fragmede 4 years ago | | |

> -validating stolen credit card details

To be clear, "validating" is an industry euphemism for stealing, just for a different purpose. How do you validate the card is live? Run a real transaction through it and mark it based on the result. But what do you run for this real transaction? Well, whatever you want. Typically it'll be something to avoid suspicion as much as possible, but the thief gets to pick what they test it with, so why not pick something that they'll personally benefit from? There are so many online games with purchasable currency these days, it's hard to choose.

rndgermandude 4 years ago | | |

All very valid reasons (except to a degree the "screwing with metrics" one). There a lot of websites which do not really face any of the aforementioned issues, simply because they do not sell you anything, are not an ad network or run referral programs, do not even have user-generated content, etc. And even the sites that do, it's usually a rather small part of the "surface" that needs such protections e.g. the actual API call to make a checkout or post a comment.

And yet if you dare browse the web with TOR or a VPN or sometimes just happen to be on a small ISP[0][1] then you're being punished immediately. You solve your cloudflare-supplied captcha because you may be a bot (you're not, and the dangerous bots will not be defeated by this anyway, but some humans will be), and then you get an error from the website itself because it runs a secondary bot detection thing. And you weren't even anywhere near anything "dangerous" like a checkout page.

[0] My parents use a regional small ISP (but locally very popular) that serves around 50K customers. My parents also use a regional bank (a Volksbank, and those are members of a national association that provides all kinds of services). Suddenly that bank would not even let them see the bank's front page. After some back and forth on the phone support line it turned out the bank had recently deployed some "advanced" bot detection, one that had a whitelist of residential-ISP-associated AS/IP ranges, and of course whoever compiled and maintained that list had forgot to include that small local ISP. For that regional bank it meant they had just shut out a very significant number of their customers (and potential customers just trying to look up what the bank offers), as there was very likely a huge overlap of people using that regional ISP and that regional bank (both are regional, after all). It also was something they couldn't fix themselves, as the "online banking" stuff was not in-house but was run by the national association (which probably used some bot-detection as a service provider). It took the bank (or rather, the national association) a few weeks to fix. Mind you, the last few years that bank has been heavily marketing a cheaper "online only" account, only online banking, online support, and access to the self-serve ATMs and banking terminals, but no face-to-face or even ear-to-ear human interaction. Try contacting "online support" about "the website outright refuses me" when the "online support" is only available on that website. Kudos is you're smart enough to switch to their mobile app, as your phone uses a different ISP, unless you forget to turn off wifi. That's the advice my parents got from the phone support (they sill have an account type that not online-only).

[1] When I recently visited my parents, from the wifi [same ISP as in 0] I couldn't open the website of a bakery too look up if and when they would be open on a Sunday. Some error message about "this website is not available in your network" (English text, for a German bakery... suspicious :P). I could open it via my mobile, tho. I could open it from my regular ISP when back home (in another city) again. Mind you, that website is purely informational and has no "interactive" features let alone let's you buy anything. It's just static text and some pictures.

oxymoron 4 years ago | |

Yeah, I used to work for one of the major anti-bot vendors. Customers weren't clueless. Nobody buys these solutions because they're so much fun, it's a cost center and they monitor their ROI quite closely. Credit card charge backs, impact to infrastructure, extra incurred cost due to underlying api's (like in the Airline industry in particular) etc are all reasons why bot mitigation is a better option than nothing for a lot of companies, even if it's not 100% effective.

azalemeth 4 years ago | | |

You very much missed the false positive rate! I'm fed up of being classed as a bot just because I browse with uMatrix, a Linux user agent, and a ton of ad filtering and anonymisation tech. I had to try to log in to my bank about ten times today because their js-crap website didn't like me (grumble why does it even need to ask for my desktop's accelerometer data via js...)

Stuff like this is a pain beyond pain. I really hope that the clients you mention know that they piss off a proportion of their users with every move they take.

spookthesunset 4 years ago | | |

Not to mention a lot of these bots are after scamming the company’s own customers. Breaking into accounts to commit fraudulent activity, to reach out and “recruit” people into whatever scam they are trying to run.

Nobody wants to spend time trying to stop these bots. It is, however, a very necessary thing to do.

dzhiurgis 4 years ago | | |

Do you know much about airline api pricing more?

I’ve noticed most sites won’t let you search business fares efficiently, so I made my own for Google Flights which only worked for like 6months until they added bunch of changes that made it near impossible to scrape.

ivanhoe 4 years ago | |

Saying that Anti-bot software is nonsense is like saying that door locks are snake oil too. We've all seen Lockpicking Lawyer on Youtube opening with ease any lock out there, so how come that all of us haven't got robbed yet?

Well, because protection is not a binary thing - either being 100% safe or 100% not working - instead it's a proportion between the skill/effort/time needed to break in, and the reward you get for it.

To stop majority of attacks you don't have to be absolutely unbreakable, you just need to make it hard enough for majority of attackers so that it doesn't payout for them compared to the value of the data you're protecting. And that's where anti-bot SW has it's place, it slows done spiders and global attacks, forcing for custom tailored scraping that is constantly being fine-tuned, infrastructure to hide your IPs, and that makes the operation way more expensive and harder to run continuously...

melony 4 years ago | |

The gold standard is residential IP. It is not cheap but its effectiveness is indisputable.

northwest65 4 years ago | | |

Back when we had to scrape airline websites to get the deals they withheld for themselves, residential IP was indeed the way. Once the cottoned on to it and blocked id, you'd simply cycle the ADSL model, get a new IP, and off you'd go again.

Now the best part... one division (big team) of our company worked for the (national carrier) airline , one division of our company worked for the resellers (we had a single grad allocated to web scraping). The airline threw ridiculous dollars at trying to stop it, and we just used a caffeine fueled nerd to keep it running. It wasn't all fun though, they'd often release their new anti scraping stuff on a Friday afternoon. They were less than impressed when they learnt who the 'enemy' was. Good times!

jonatron 4 years ago | | |

A residential IP would help for IP based detection. As the Readme mentions, there's also Javascript based detection. If, for example, your browser has navigator.webdriver set incorrectly, then you can still get blocked even on a residential IP.

TedDoesntTalk 4 years ago | | |

Not anymore. Now it’s mobile IP addresses.

sparkling 4 years ago | | |

There are services that detect residential IPs being used for scraping nowadays. Plus there are other ways of detecting scraping: browser fingerprinting, aggressive rate-limiting and CAPTCHAs etc.

bredren 4 years ago | | |

Unless you use a residential ip proxy network.

hattmall 4 years ago | |

I've always thought credential stuffing and most password hacking attempts could be defeated by simply logging into randomly generated dummy accounts if the password is wrong. Just make it so that the same username / password combo takes you to the same random info. Real users should notice things were wrong immediately but bots would have no way to tell unless they already knew some of the real information.

sparkling 4 years ago | |

> Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks.

I disagree. Obviously there is no way to 100% stop scraping, but a for a rather small amount of $ you can implement some measures that make it harder. Services like https://focsec.com/ offer ways to detect web scrapers using proxy/VPNs (one of the most common techniques) for little money.

> Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

Keep in mind that they may be legally or contractually forced to do this. Think of Netflix who are investing heavily into their Anti-VPN capabilities, most likely because they have contracts with content publishers & studios that force them to do so.

devit 4 years ago | |

If users using weak/reused passwords is your problem, just don't let users choose a password (generate it for them), or don't use passwords at all (send link by e-mail that adds a cookie), or use oauth login.

folmar 4 years ago | | |

Link-only login is the most underused security option, even more so for low-profile sites that need a minimal user account but do not really need full-on security.

jjav 4 years ago | |

> None of these anti-bot providers give a shit about invading your privacy, tracking your every movements, or whatever other power fantasy that can be imagined.

There is vast amount of profit available in doing just that (see e.g. GOOG and FB market cap). Even companies that truly have no intention of exploiting data collected as a side-effect of whatever product line they do, nearly always eventually end up going for that profit line. Because passing up on more income merely due to moral considerations is too much of a temptation for a company to be able to resist in the long term.

chucksmash 4 years ago | |

The credential stuffing wiki page didn't exist the last time I thought about invalid traffic so I'm pretty out of date.

How is there not an equilibrium here that cuts off credential stuffers? I'd naively imagine the residential IP providers have some measure of bad actors they themselves use to determine if a client is worth it, and that someone getting all your IPs blacklisted would get dropped pretty quickly.

judge2020 4 years ago | | |

In reality residential US ISPs don't really care if their users are getting a sub-par experience since they're often the only fast/fiber provider in the area of their customers, meaning customers have no way to switch. Plus, when a website doesn't work, unless the page itself calls out the ISP (which they never do), customers will think it's an issue with the website and won't possibly attribute blame to their ISP until they're deep in forum threads with people telling them "it's probably your ISP not doing anything about bad customers" - the amount of users going so far to learn this information, then accepting it, is extremely low.

astatine 4 years ago | |

On a site I used to run, there was no content which needed protection. So, it was not much of a pain except that there would be a lot of bot- filled contact forms. Slowly the problem became severe enough that bandwidth fees started becoming an issue. Finally had to use cloudflare in the front to reduce bandwidth usage. It worked but the side-effect was that some valid users may now get blocked.

nextaccountic 4 years ago | |

> Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

I don't know about $10m/year, but many sites block bots just because they don't want competitors to access publicly available data. Which is bullshit.

krageon 4 years ago | |

> he'd probably take down the whole repository.

I know how bad this issue is, and I wouldn't take down this repository. Anti-bot software does not work, anyone who pays 10m per year to have it simply has too much money.

mindslight 4 years ago | |

If your password db is so broken that it's useful to create a term to abstract attacks ("credential stuffing"), then the right answer is to actually fix that security (eg pick users passwords for them, or completely replace with email auth), rather than thinking you're raising the bar by requiring attackers to come from a residential IP.

Gigachad 4 years ago | |

2FA should be a requirement on everything now. And if your site can't for some reason or you don't want to deal with it, then limit your site to external login providers only.

2FA, especially app based, has been proven to work really really well.

ixs 4 years ago | | |

It does not. There are myriad ways of extracting the TOTP seed from these apps... Or you just reverse engineer the setup/confirmation process and then you can generate/trigger your own tokens from your automation workflow.

2FA is a good security feature but it does not help against web scraping. Credential stuffing and other 3rd party attacks? Yes, it _can_ help. But it does not always help. There's a phishing group that has seemingly specialised on getting people to click the green confirm button in their Duo app... ¯\_(ツ)_/¯

Check https://github.com/revalo/duo-bypass for a python script that can be used to automate Duo tokens... Has some code from me. There are similar scripts for all the other well known OTP Apps...

walrus01 4 years ago | | |

How do you propose to implement two-factor authentication, on something like the public facing homepage of an airline ticket price search website, where if you make people "sign in with google" or whatever, a sizeable proportion won't do it and will just go to the competition?

cultofmetatron 4 years ago | | |

thats great till you're in a foreign country and your phone suddnely decides to die leaving you stranded and unable to access bank accounts or prove your identity. (happened to me)

selfhoster11 4 years ago | | |

Hard disagree. A recipe sharing website or food delivery service DOES NOT require as much security as my email or bank account, and never will.

5faulker 4 years ago | |

Same thing goes with ad blocking to a similar degree.

mdoms 4 years ago | |

If it was just about credential stuffing they would only put limits on POST requests.

ChuckMcM 4 years ago |

I am always amazed when otherwise intelligent people assert without data that the marginal cost of serving web traffic to scrapers/bots is zero. It is kind of like people who say "Why don't they put more fuel in the rocket so it can get all the way into orbit with just one stage?"

It sounds great but it is a completely ignorant thing to say.

ufmace 4 years ago |

What I really enjoy about this thread is all of the completely different perspectives. Lots of people doing anti-abuse research bemoaning that this stuff exists, and lots of people working against what are from their perspective ham-handed anti-abuse tech blocking legitimate useful automation trading tips on how to do it better. I guess the other sides of those we don't see much. People doing actual black-hat work probably don't post about it on public forums, and most of the over-broad anti-abuse is probably a side effect of taking some anti-abuse tech and blindly applying it to the whole site just because that's simpler, often no tech people may be really involved at all.

matheusmoreira 4 years ago | |

When these companies endeavor to stop abuse, they trample all over our freedoms. Suddenly we can't have non-browser user agents anymore. Suddenly we can't root our smartphones anymore. They want nothing to do with us unless it's 100% on their terms with us completely under their control.

Spivak 4 years ago | | |

There seems to be wildly different perspectives on "bad actors means we can't have nice things" -- one group says that this is a fact of life, and the other says that this is an affront to freedom. A non-tech example is I've had guys on Tinder get legit angry at me for insisting that our first few dates have to be in public places where we drive separate -- "oh so you think I'm some creepy stalker?" And like I am totally empathetic to their hurt because I'm sure that they know they're good but I don't and there's now way for me to tell in advance. Malicious actors don't exactly announce themselves and actively try to hide their intent.

The solution to the automation problem is to do what most companies do and have registered API integrations.

bryan_w 4 years ago | | |

Yeah, but you're not entitled to use their servers. If your use of their servers is something they don't like, their freedom is to blackhole your packets

marginalia_nu 4 years ago |

If someone is signalling to you you that they do not want your bot on their site, then maybe respect that? Trying to circumvent it is besides being legally questionable, a serious pain in the ass for the site owner and makes websites more prone to attempt to block bots in general.

Also, in my experience, most websites that block your bot, block your bot because your bot is too aggressive, or because you are fetching some resource that is expensive that bots in general refuse to lay off. Bots with seconds between the requests rarely get blocked even by CDNs.

al2o3cr 4 years ago |

    You use this software at your own risk. Some of them contain malwares just fyi

LOL why post LINKS to them then? Flat-out irresponsible...

    you build a tool to automate social media accounts to manage ads more efficiently

If by "manage" you mean "commit click fraud"

abadger9 4 years ago |

I'm a lead engineer on the search team of a publicly traded company who's bread and butter is this domain. I was curious about this list, it candidly misses the mark- the tech mentioned in this blog is what you might get if you hired a competent consultant to build out a service without having domain knowledge. In my experience, what's being used on the bleeding edge is two steps ahead of this.

curun1r 4 years ago |

There’s one technique that can be very useful in some circumstances that isn’t mentioned. Put simply, some sites try to block all bots except for those from the major search engines. They don’t want their content scraped, but they want the traffic that comes from search. In those cases, it’s often possible to scrape the search engines instead using specialized queries designed to get the content you want into the blurb for each search result.

This kind of indirect scraping can be useful for getting almost all the information you want from sites like LinkedIn that do aggressive scraping detection.

amelius 4 years ago | |

But won't the search engines block you after some limit has been reached?

curun1r 4 years ago | | |

Eventually, but they’re not very aggressive when it comes to bot detection. Simple IP rotation usually works.

tomrod 4 years ago | | |

Some.

janmo 4 years ago | |

Or, you can spoof the google bot or Bing bot user agent and try to scrape the site that way.

notriddle 4 years ago | | |

Impersonating Googlebot is a great way to get blocked. Real Googlebot only comes from certain IP addresses.

rp1 4 years ago |

It's very easy to install Chrome on a linux box and launch it with a whitelisted extension. You can run Xorg using the dummy driver and get a full Chrome instance (i.e. not headless). You can even enable the DevTools API programmatically. I don't see how this would be detectable, and probably a lot safer than downloading a random browser package from an unknown developer.

xiamx 4 years ago | |

Try your technique on a few of these fingerprint testing sites https://github.com/niespodd/browser-fingerprinting#fingerpri... I'm pretty sure it's quite detectible

rp1 4 years ago | | |

Hmm maybe I will if I have time. We've been using this technique for user-initiated scraping. The only issue we've run in to is we get rate-limited by IP sometimes. Changing the IP has solved the problem each time.

Lukabuz 4 years ago | | |

If I am correct in assuming the parent is talking about puppeteer, there is a plugin[1] that claims to evade most of the methods used to detect headless browsers. I have used it recently for just that purpose, and I can say that it worked wifh minimal setup and configuration for my usecase, but I guess depending on the detection mechanisms youre evading YMMV.

The creator of that plugin does mention it is very much a cat and mouse game, just like most of the “scraping industry”

https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

walrus01 4 years ago |

Google "residential proxies for sale" if you want to see the weird shady grey market for proxies when you need your traffic to come from things like cablemodem operator ASNs' DHCP pools

spookthesunset 4 years ago | |

Wonder what fraction of that traffic is from p0wned IoT refrigerators, smoke detectors or WiFi enabled light bulbs… probably more than anybody cares to admit…

preinheimer 4 years ago | | |

Lots of them, the vast majority of the players in that space are absolutely terrible: https://medium.com/@xianghangmi/resident-evil-understanding-...

jrockway 4 years ago | | |

Some amount, probably, but some of these "residential IP" providers just buy IP blocks from small residential or business ISPs. (Or even less shadily, buy IP transit from them and get assigned a block of IPs that the databases think are normal residential/business users.)

walrus01 4 years ago | | |

Also a lot of people who've been tricked into installing malware on their windows PCs, from shady "VPN" operators and other

cmauniada 4 years ago | |

I’ve been using some of these for a shoe bot that I’m working on. They are pretty hit or miss.

welanes 4 years ago |

Another great resource is incolumitas.com. A list of detection methods are here: https://bot.incolumitas.com/

I run a no-code web scraper (https://simplescraper.io) and we test against these.

Having scraped million of webpages, I find dynamic CSS selectors a bigger time sink than most anti-scraping tech encountered so far (if your goal is to extract structured data).

kingcharles 4 years ago | |

Can your scraper be used to scrape images? I need to scrape some books from a paywalled site and they are presented a page at a time. The JS code is too complex for me to bother trying to figure out how it creates the unique tokens it applies to every image it displays to avoid a very simple scrape.

peterburkimsher 4 years ago |

2 of my social media accounts have fallen victim to bot detection, despite not using scripts. There are other websites for which I have used scripts, and sometimes ran into CAPTCHA restrictions, but was able to adjust the rate to stay within limits.

CouchSurfing blocked me after I manually searched for the number of active hosts in each country (191 searches), and posted the results on Facebook. Basically I questioned their claim that they have 15 million users - although that may be their total number of registered accounts, the real number of users is about 350k. They didn't like that I said that (on Facebook) so they banned my CouchSurfing account. They refused to give a reason, but it was a month after gathering the data, so I know that it was retaliation for publication.

LinkedIn blocked me 10 days ago, and I'm still trying to appeal to get my account back.

A colleague was leaving, and his manager asked me to ask people around the company to sign his leaving card. Rather than go to 197 people directly, I intentionally wanted to target those who could also help with the software language translation project (my actual work). So I read the list of names, cut it down to 70 "international" people, and started searching for their names on Google. Then I clicked on the first result, usually LinkedIn or Facebook.

The data was useful, and I was able to find willing volunteers for Malay, Russian, and Brazilian Portuguese!

After finding the languages from 55 colleagues over 2 hours, LinkedIn asked for an identity verification: upload a photo of my passport. No problem, I uploaded it. I also sent them a full explanation of what I was doing, why, how it was useful, and a proof of my Google search history.

But rather than reactivate my account, LinkedIn have permanently banned me, and will not explain why.

"We appreciate the time and effort behind your response to us. However, LinkedIn has reviewed your request to appeal the restriction placed on your account and will be maintaining our original decision. This means that access to the account will remain restricted.

We are not at liberty to share any details around investigations, or interpret the terms of service for you."

So when the CAPTCHA says "Are you a robot?", I'm really not sure. Like Pinocchio, "I'm a real boy!"

arp242 4 years ago | |

CouchSurfing is just shit, full stop. I love the concept and hosted many people, but the way the company has been run over the last few years is beyond atrocious. It's like AirBnB sent over some people to intentionally run it in to the ground or something.

LinkedIn has to deal with a lot of scummy recruiters and scammers; I don't blame them for being very strict.

nocturnial 4 years ago |

I knew there was a reason why I used client certificates and alternate ports.

Why is it so difficult to just respect robots.txt? Maybe there's an idea for a browser plugin that determines if you can easily scrape the data or not. If not, then the website is blocked and then traffic will drop. I know this is a naive idea...

remram 4 years ago | |

I don't understand what you recommend. Who would drop the traffic?

teeray 4 years ago |

Never underestimate the scraping technique of last resort: paying people on Mechanical Turk or equivalent to browse to the site and get the data you want

adinosaur123 4 years ago |

Are there any court cases that provide precedence regarding the legality of web scraping?

I'm currently looking for ways to get real estate listings in a particular area and apparently the only real solution is the scrape the few big online listing sites.

Grimm1 4 years ago | |

https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

That’s one of the bigger ones. Unfortunately recent events means scraping is still a gray area.

amelius 4 years ago | | |

Legal gray areas are perfect for growth hacking. Just look at Uber and AirBnb.

omgwtfbyobbq 4 years ago | | |

Do you mean this case?

https://en.m.wikipedia.org/wiki/Van_Buren_v._United_States

I think it only applies to systems that aren't available to the general public, which in this case was the GCIC. Anything that is available to the public, even if it requires some sort of registration, would I think be legal to scrape. YMMV though.

adanto6840 4 years ago | |

I was involved in a scraping-related case, though in my situation we were scraping public domain data/facts/public domain media. Email me if you'd like additional info. :)

More related to the submission content -- at the time we used rotating proxies, both in-house & external (ProxyMesh - still exists & only good things to say about it); they allowed us to "pin" multiple requests to an IP or to fetch a new IP, etc...

IceWreck 4 years ago |

Half of the short-links to cutt.ly aren't working. Why use short links in markdown ?

yamakadi 4 years ago | |

It’s most likely for tracking clicks. Better to just search for the company names instead of clicking on the links in case they lead to unexpected places.

ev1 4 years ago | | |

It's hidden affiliate spam without disclosure.

dpryden 4 years ago |

It always amazes me how people believe they have a right to retrive data from a website. The HTTP protocol calls it a request for a reason: you are asking for data. The server is allowed to say no, for any reason it likes, even a reason you don't agree with.

This whole field of scraping and anti-bot technology is an arms race: one side gets better at something, the other side gets better at countering it. An arms race benefits no one but the arms dealers.

If we translate this behavior into the real world, it ends up looking like https://xkcd.com/1499

zarzavat 4 years ago | |

Because often that data is only available through scraping.

Nobody wants to scrape, it's messy and fickle and a general pain in the backside. But sometimes the data you need exists only in that form.

If you run a website and you have a problem with scrapers, then make all that data available through an API and say what acceptable rate limits are. If cost is an issue, then charge a proportionate fee, my time writing a scraper is worth much more than paying a few dollars for an API.

If you just say "No" to everything then you lose all control over the process and the only outcome will be such an arm race.

kingcharles 4 years ago | | |

God. This. The number of times I've spent 2 days of my very expensive time coding a scraper to get data I'll use once, when I would have paid a few dollars just to download it in a text file.

connectsnk 4 years ago |

For the row "Long-lived sessions after sign-in" the author mentions that this solution is for social media automation i.e. you build a tool to automate social media accounts to manage ads more efficiently.

I am curious by what the author means by automating social media accounts to manage ads more efficiently

namdnay 4 years ago | |

Clickfraud

kseifried 4 years ago |

Trying to stop credential stuffing by blocking bots will not work, and can often severely impact people depending on assistive technologies.

I think a better solution is to implement 2FA/MFA (even bad 2FA/MFA like SMS or email will block the mass attacks, for people worried about targeted attacks let them use a token or software token app) or SSO (e.g. sign in with Google/Microsoft/Facebook/Linkedin/Twitter who can generally do a better job securing accounts than some random website). SSO is also a lot less hassle in the long term that 2FA/MFA for most users (major note: public use computers, but that's a tough problem to solve security wise, no matter what).

Better account security is, well, better, regardless of the bot/credential stuffing/etc problem.

softwaredoug 4 years ago |

A lot of web scraping is annoying often because there’s *an explicit API built for the scrapers needs*. Instead of looking for an API, many think to first use web scraping. This in turn puts load and complexity on the user facing web app that must now tell scraper from real users.

no_time 4 years ago | |

Using the API almost always has more "strings attached". Like you have to register and get an API token or something. Or even pay. If you want people to use your API, don't make it less convinient than scraping the page.

fragmede 4 years ago | |

But if there's an API, then the overall load is the same, no?

Or to put it another way, naively, having api.example.com and realpeople.example.com separated out into separate sandboxes seems reasonable, but due to the aforementioned problem, its not. But then it also turns out to be the wrong axis for this anyway, and you need your monitoring to work for you.

kingcharles 4 years ago | | |

No, the load isn't the same because the web page might be a multi-megabyte monster piece of badly-coded HTML that returns only say 10 out of 1,000,000 results and needs to be paged through, where the API might return all the million results in a nice JSON chunk.

greeklish 4 years ago |

Here's a good resource about web scrapping: https://bot.incolumitas.com/#:~:text=more%20sources%2Finform...

kinderjaje 4 years ago |

I am running a no-code web automation and data extraction tool called https://automatio.co. And from my experience most of the time when using quality residential proxies you will be fine. But that comes at cost since they are way expensive then data center proxies.

But for some websites, even residential ips doesn't let you pass.

I noticed there is like a premium reCaptcha service, which just work differently then standard one and not let you pass. It's mostly shown with a Cloud flare anti bot page.

intricatedetail 4 years ago |

By the way - is it possible to stop Google bot from scrapping without maintaining a list of IP addresses? Google doesn't publish these and it's not good to run reverse DNS as it slows down legitimate clients. I know you can put a meta tag, but bot still has to make a request to read it. I would like to completely cut off Google from scrapping.

jrockway 4 years ago | |

You can buy databases of who owns which IP blocks.

If you really care but don't want to spend the money, just block the subnet each time you see a Googlebot request. "whois w.x.y.z" returns an entire CIDR, and it seems unlikely to me that Google is scraping from a bunch of disconnected /24s.

drivebycomment 4 years ago | |

Just put robot.txt and block Googlebot from there. Google obeys robot.txt.

intricatedetail 4 years ago | | |

Not all Google crawlers obey it. Also if Google already indexed something, only way is to let it crawl again and see meta noindex. It's a mess.

rfraile 4 years ago |

Datadome, PerimeterX, anyone tried ine if them?

cmauniada 4 years ago | |

I’ve tried skirting datadome but generally you can just get around it by rotating ips, apparently there is a way to de-obfuscate their apps (apps that use datadome services) to retrieve datadome cookies but I haven’t been bothered to check it out yet.

lavezzi 4 years ago | |

Walmart uses PX and it's pretty easy to bypass.

ev1 4 years ago | | |

They also silently load Threatmetrix now under a walmart domain CNAME.

peterburkimsher 4 years ago | |

CouchSurfing uses PerimeterX for profiles.

navels 4 years ago |

I've had a lot of success just with Selenium and this custom version of Chromedriver: https://github.com/ultrafunkamsterdam/undetected-chromedrive...

Jenk 4 years ago |

In a previous venture my team successfully circumvented bot detection for a price comparison project simply by using apify.com. Wasn't that expensive, either. We were drilling sites with 500k+ hits per day for months.

janmo 4 years ago |

egberts1 4 years ago |

A couple of things for unblockable scraping

1. plenty of VPS with many IP addresses (this is easier with IPv6 subnet)

2. HTTP header rearranging

3. Fuzzing user-agent

4. Pseudo-PKBOE algorithm

5. office hours, break-time, lunch-time activity emulation

6. ????

7. profit

I am looking at you, SSH port bashers.

completelylegit 4 years ago |

* Scrape open proxy websites for open proxies, then use those proxies, cycle which proxies you use frequently.

* Change your user-agent to a real user-agent, cycle it frequently.

* Done.

billpg 4 years ago |

You could ask first. The site's robots.txt file might have some information.

Put your email address in your User-Agent string so they can get in touch if needed.

lavezzi 4 years ago |

The proxy service recommendations are pretty expensive. Does anyone have alternatives they suggest to keep costs down?

hk1337 4 years ago |

Not to forget the most important rule, don't be an asshole to the site hosting the content.

0xlwj 4 years ago |

Pretty useful crash course on what is out there in the web scraping universe

lifeisstillgood 4 years ago |

What if we solved it by replacing passwords with client HSMs?

firerfly 4 years ago |

plivo.com is good at anti-bot, i tried many method and some residential proxys . there still blocked me out .

nuker 4 years ago |

Will I scrape faster with RTX 3080 Ti?

shapefrog 4 years ago | |

Absolutely, but to get one you will have to have RTX 3090 scraping speeds.

curl "https://www.ryanair.com/api/booking/v4/en-gb/availability?ADT=1&CHD=0&DateIn=&DateOut=2021-11-15&Destination=BER&Disc=0&INF=0&Origin=MAN&TEEN=0&promoCode=&IncludeConnectingFlights=false&FlexDaysBeforeOut=2&FlexDaysOut=2&ToUs=AGREED" | jq

* We use our own mobile emulation software (similiar to bluestacks). Turns out, mobile helps with a lot of things (below). * We use mobile IPs only. Mobile LTE data users are behind CGNATfor IPV4. You can't block one ip without possibly blocking hundreds of innocent IPs using the same exit point. * All you need is a new useragent and browser fingerprint; combined with emulation + mobile IPs, there's really no easy way for companies to block this. * With the advent and ease of virtualization; we avoid using any headless browsers. Seriously, if you can, never use headless. This should be close to rule number one for anyone looking to operate any kind of scrapers. All of our scrapers are run in isolated virtual instances with full mobile browsers. * We can easily reset our device identifier, device carrier, simulated SIM information, and especially important is the Google advertising ID that is set per device; the list goes on. The key here is #1, our mobile emulation software. * Our automation scripts are a combination of human recorded set of actions which we then perfected and can run in certain loops (for some of our data).