The State of Web Scraping 2022

The State of Web Scraping 2022(scrapeops.io)

291 points by Ian_Kerins 4 years ago | 144 comments

KieranMac 4 years ago |

As a lawyer whose primary focus is in web scraping, this article is in many ways misleading and inaccurate. While it is true that the Van Buren case is generally positive for web scraping, the overall legal landscape is still murky. The main battleground for web scraping legal issues is shifting from the CFAA to breach of contract and various state-law issues, including misappropriation, unjust enrichment, and trespass to chattels.

In my opinion, 2021 was a bad year for the law as it relates to web scraping. The Supreme Court remanded hiQ Labs, and many high-profile lower-court cases ended badly for web scrapers. It's a darker shade of gray than it was in 2020. It can be navigated, but it's tricky.

btown 4 years ago | |

Not a lawyer, but is it at least true that web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA?

I'm often reminded of the fact that in https://en.wikipedia.org/wiki/United_States_v._Swartz the scraped party JSTOR did not desire to press civil charges, but due to the criminal component of the CFAA, this was out of their hands - and the story ended in the worst possible way.

If the current legal landscape at least better restricts disputes over web scraping to civil litigation, it may not be a huge change for how companies look at their risks, but it could make a huge difference for individuals caught in the crossfire.

KieranMac 4 years ago | | |

Yes, I would agree with that first sentence. After Van Buren, web scraping alone would now be significantly less likely to be a basis for federal criminal prosecution under the CFAA.

digitcatphd 4 years ago | |

Good take, IMO ethically speaking we should not penalize scrapers themselves but do so based on their use.

Scraping Facebook to make a clone of profiles shouldn’t be held to the same scrutiny of scraping Facebook to do an internal analysis of user demographics for research purposes.

ForHackernews 4 years ago | | |

Why should either be discouraged?

RobSm 4 years ago | |

How many contracts google breaches scraping billions of pages every month?

Fatnino 4 years ago | | |

Google doesn't have to proactively try very hard to ingest sites. If something is difficult for Google to scrape they don't sped loads of engineer hours on getting it to work. They just leave the site out and the webmaster there will quickly bend over backwards to make sure Google can scrape them. When something gets scraped into Google inadvertently it's because the website made not even the slightest effort to protect itself.

KieranMac 4 years ago | | |

Given the nuances of browsewrap contract enforceability, perhaps not as many as you suggest. The tricky part with navigating this gray area is knowing the likely circumstances when a contract of adhesion may give rise to an actual legal claim. There are patterns.

Ian_Kerins 4 years ago | |

Interesting!...I'm not a lawyer, so the content for this piece was based on commentary in the below article. Was written by their lawyer, but would love to hear your counter point to it. Always good to get multiple viewpoints on something.

https://www.zyte.com/blog/van-buren-a-victory-for-web-scrape...

KieranMac 4 years ago | | |

The Zyte article isn't inaccurate; it's just a simplified assessment of a complicated issue. If you'd like a more nuanced perspective on this, please read my guest post of Prof. Goldman's blog.

https://blog.ericgoldman.org/archives/2021/06/more-perspecti...

faizshah 4 years ago | |

Is there a good blog or something that tracks these cases?

KieranMac 4 years ago | | |

Prof. Eric Goldman's blog is probably the #1 site historically on scraping and the law. I've contributed to it a few times.

https://blog.ericgoldman.org/archives/2021/06/more-perspecti...

The name of my firm is McCarthy Garber Law. I write about scraping there when I have time (which I rarely do)!

samcrawford 4 years ago | |

Enjoyed reading your bio on your website. Sub 24 hour at Leadville is super impressive! (Coming from someone who has not managed 24 hours at Western States... Yet...)

KieranMac 4 years ago | | |

Leadville is just 45 minutes up the road for me, so I'm kind of cheating!

Seattle3503 4 years ago | |

Is there a good blog post or summary that I could read?

KieranMac 4 years ago | | |

https://mccarthygarberlaw.com/a-comprehensive-legal-guide-to...

ok_coo 4 years ago |

Time for me to advocate again for people to use Common Crawl. Please don't slam peoples' websites, look for alternatives before scraping. There are probably other, better options. APIs, data set downloads, etc.

https://commoncrawl.org/

dewey 4 years ago | |

I'd guess that for the many popular scraping uses cases this is not really useful as it's usually about being quick and up to date (job postings, availability information, e-commerce, serps,...) not about having a big corpus of historic data.

weird-eye-issue 4 years ago | |

Have you used this in real world scenarios? Or is it just a nice hypothetical that sounds great in theory but almost never works in practice?

LunaSea 4 years ago | |

Common Crawl is missing far too many URLs for it to be useful in a real world scenario.

Chris2048 4 years ago | | |

But can't you add to their index?

mycall 4 years ago | |

I wish web.archive.org had an index by someone like common crawl. There is lots of great stuff on archive.org

wumpus 4 years ago | | |

web.archive.org has a CDX index, similar to Common Crawl.

Since I use both of these archives together, I wrote this code to iron out the differences between them:

https://github.com/cocrawler/cdx_toolkit

kevinsundar 4 years ago | | |

They do and its better than common crawl's by my testing.

joe_91 4 years ago | |

That looks like a great resource! How often is the data set "updated"?

I'd imagine most people's use cases need data which can change from day to day or week to week but I do think that this is fantastic if I was to have a project which was looking at data across a longer timeframe.

jimkri 4 years ago | |

That is too much data to parse for a simple website scrape.

I do think Common Crawl has a lot of potential for people to use instead of scraping, but I think its for larger projects. It gave me the idea to look at the links to ID if they are a business or non-business website

joe_91 4 years ago |

I'm scraping about 30 sites for work at the moment, but have a few that are using Cloudflare which has been a b*tch to deal with. Tried numerous libraries and different proxy providers, but reliability is patchy. Previous fixes like https://github.com/Anorov/cloudflare-scrape don't seem to work anymore after Cloudflare updates, so I've switched to using a pretty optimised headless browser with good proxies instead.

Ian_Kerins 4 years ago | |

This has a lot of good info on how to cloudflare and others work, and more creative ways to bypass them if the easier options don't work https://incolumitas.com/2021/05/20/avoid-puppeteer-and-playw...

nanna 4 years ago | |

I'm finding that Cloudflare is even blocking my RSS reader from requesting feeds behind their service. It's not even just scrapers at this point.

nsonha 4 years ago | |

> optimised headless browser with good proxies instead

are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?

temp8964 4 years ago | |

I think it will eventually goes to like stock trading. If you have a good strategy, you don't want to share with the world, because it will render your strategy useless.

emptysea 4 years ago | |

Is the “pretty optimized headless browser” an off the shelf thing, or something custom? Are you using playwright/puppeteer to drive it?

mycall 4 years ago | | |

Headless Chrome [0] and alpine-Chrome [1] are pretty popular. Some variations also include V2Ray, Shadowsocks and other VPNs.

[0] https://hub.docker.com/r/justinribeiro/chrome-headless/

[1] https://github.com/Zenika/alpine-chrome

mellosouls 4 years ago |

With the right combination of proxies, user agents and browsers, you can scrape every website. Even those that seem unscrapable.

This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).

Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.

Ian_Kerins 4 years ago | |

100% agree, when scraping it should always be done respectfully.

- If they provide a API, then use it.

- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).

- If you can get cached data from somewhere that works, then use that.

Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.

The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.

travisporter 4 years ago | | |

Don’t get my home address, name, family members names, salary, cell phone number, aggregate and sell them and claim “it’s all publically available anyway”

Terry_Roll 4 years ago | |

You dont even need to do that, go overt plain sight in yer face and call yourself a search engine!

joe_91 4 years ago | | |

Haha I love that people forget how google/bing are out there scraping everything and anyone who scrapes anything for any other reason is a "bad guy".

You can get around some web scraping blockers by just setting your user agent as Googlebot too which I find funny...

bryanrasmussen 4 years ago | |

this sort of implies that the 'ethics' would end up meaning that you shouldn't scrape if it is not wanted, although I suppose there can be ethics or other than commercial requirements that mean that you should.

NDizzle 4 years ago |

I still have a daily job running a web scraper I first wrote with Scrapy back in 2017. I think I've had to update it 3 times over the years for changes to the site and web standards.

Good old government sites - rarely change!

bobblywobbles 4 years ago |

Not a lawyer, but many terms of service prohibit interacting with their website in an automated fashion, as well as collecting their data. In my understanding, scraping a site with these terms already puts you in the wrong.

akersten 4 years ago | |

> many terms of service prohibit interacting with their website in an automated fashion,

Ignoring the fact that I didn't agree to anything just by virtue of requesting a page from a webserver (and, your server sent me the data!), that's such a meaningless phrase that it's certainly unenforceable. What is an automated fashion? Do I have to manually craft my HTTP request by hand-pulsing a voltage on an Ethernet cable, or do I have your permission to let Chrome automate that for me?

RobSm 4 years ago | | |

This is so exactly. People do not realize that when they use chrome to view website, chrome is their 'scraper'.

And the goal of webs craping is not to get illegal data, but to have efficiency and performance by not doing something manually but letting computer do the repetitive tasks. It's a productivity tool. You can't make something illegal just because it's an automation instead of 'manual' operation.

tommek4077 4 years ago | |

Because those terms are the law and cant be ignored in almost all the rest of the world...

cblconfederate 4 years ago |

Cloudflare's blocks get in the way of many websites who are simply trying to get a "link preview" of the page, even if it is only a single request from a new IP. I wish they would offer some kind of alternative for the pages they serve instead of a captcha block.

fareesh 4 years ago |

My toolbox of choice for web scraping is either Nokogiri or puppeteer

Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?

edmundsauto 4 years ago | |

One great scrapy feauture is caching the page content. So you can essentially write a crawler, and when that’s running, you write your extraction code. Then, if you want to go back, you can add more extractors and run it against your local copy.

fareesh 4 years ago | | |

Ah interesting, I end up doing this manually, i.e. File.write followed by what I want to scrape

gmanis 4 years ago |

What does HN think of web scraping for the purpose of price comparison?

I’m asking this because I run a small side project to show prices across retailers for a very small niche. The users are very very happy. Even the vendors started contacting to be listed on the comparison.

But I am unable to make a business out of it other than few affiliate commission.

magixx 4 years ago | |

I worked for a company that did exactly this many years ago. (They were even able to parter with some retailer). Their product worked well yet they still went out of business long ago. To be honest, I don't see much value in such a service, not that it doesn't exist, it's just hard to justify paying for this data.

Ian_Kerins 4 years ago |

If anyone has anything else they think was missed or should be included then let me know!

coverj 4 years ago |

I have been interested in web scraping lately but never really dived too deep. Did anyone have more indepth resources (github projects, blogs, forums, etc) than the tutorials that are basically install beautiful soup and get data from a tag?

JimBlackwood 4 years ago | |

Genuine question but, what more do you need?

newsbinator 4 years ago |

Like most here, I am very good at web scraping and automated form fills. I keep trying to figure out a profitable side project or business idea to make out of it and keep coming up with nothing that works.

Any good ideas?

darepublic 4 years ago |

Separate from web scraping, there is the use of automation to perform normal allowable user actions on the site. That should be considered distinct from large scale data extraction no

JJxFile 4 years ago |

The web scraping ecosystem is growing, with more libraries, frameworks and products available than ever before to simplify our web scraping headaches so the future is looking bright.

slvrspoon 4 years ago |

for those in this thread with super-serious experience scraping and automating at scale, looking for work (ethical!) please contact me directly.

blantonl 4 years ago |

I fail to understand why Web Scraping isn't almost universally viewed as unethical and a terrible and nasty business practice.

In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.