The State of Web Scraping 2022(scrapeops.io) |
The State of Web Scraping 2022(scrapeops.io) |
In my opinion, 2021 was a bad year for the law as it relates to web scraping. The Supreme Court remanded hiQ Labs, and many high-profile lower-court cases ended badly for web scrapers. It's a darker shade of gray than it was in 2020. It can be navigated, but it's tricky.
I'm often reminded of the fact that in https://en.wikipedia.org/wiki/United_States_v._Swartz the scraped party JSTOR did not desire to press civil charges, but due to the criminal component of the CFAA, this was out of their hands - and the story ended in the worst possible way.
If the current legal landscape at least better restricts disputes over web scraping to civil litigation, it may not be a huge change for how companies look at their risks, but it could make a huge difference for individuals caught in the crossfire.
Scraping Facebook to make a clone of profiles shouldn’t be held to the same scrutiny of scraping Facebook to do an internal analysis of user demographics for research purposes.
https://www.zyte.com/blog/van-buren-a-victory-for-web-scrape...
https://blog.ericgoldman.org/archives/2021/06/more-perspecti...
https://blog.ericgoldman.org/archives/2021/06/more-perspecti...
The name of my firm is McCarthy Garber Law. I write about scraping there when I have time (which I rarely do)!
Since I use both of these archives together, I wrote this code to iron out the differences between them:
I'd imagine most people's use cases need data which can change from day to day or week to week but I do think that this is fantastic if I was to have a project which was looking at data across a longer timeframe.
I do think Common Crawl has a lot of potential for people to use instead of scraping, but I think its for larger projects. It gave me the idea to look at the links to ID if they are a business or non-business website
are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?
:
This outcome was great news for web scrapers, as it means that so long as a websites has made their data public you are not in violation of the CFAA when you scrape the data even if it is prohibited in some other way (T&Cs, robots.txt, etc).
Just because you can, doesn't mean you should. It would be better I think if there was a treatment of the ethics here, rather than a seemingly "ra-ra go bots" attitude, as though the only consideration is commercial.
- If they provide a API, then use it.
- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).
- If you can get cached data from somewhere that works, then use that.
Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.
The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.
You can get around some web scraping blockers by just setting your user agent as Googlebot too which I find funny...
Good old government sites - rarely change!
Ignoring the fact that I didn't agree to anything just by virtue of requesting a page from a webserver (and, your server sent me the data!), that's such a meaningless phrase that it's certainly unenforceable. What is an automated fashion? Do I have to manually craft my HTTP request by hand-pulsing a voltage on an Ethernet cable, or do I have your permission to let Chrome automate that for me?
And the goal of webs craping is not to get illegal data, but to have efficiency and performance by not doing something manually but letting computer do the repetitive tasks. It's a productivity tool. You can't make something illegal just because it's an automation instead of 'manual' operation.
Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?
I’m asking this because I run a small side project to show prices across retailers for a very small niche. The users are very very happy. Even the vendors started contacting to be listed on the comparison.
But I am unable to make a business out of it other than few affiliate commission.
Any good ideas?
In almost all cases I view Web scraping as people who are trying to build businesses on top of other people's innovation and data. I know this isn't a popular opinion, so change my mind, but at the same time, I'm one of those business owners that fights with Web scraping constantly and my opinion of it is that those that are doing it to my platforms are doing so solely to steal data and build businesses on top of other's hard work.
- build a on-demand data api for a specific type of data and charge a premium for it. Good example is https://serpapi.com/ who do Google data, charge ~10X markup on proxy costs
- proxy solutions make good money. To scrape at scale you need proxies, and lots of users pay $1-5k per month. Lots of proxy solutions doing +$100k per month.
- build a tool that uses web scraped data, analyses/filters it and displays it to users. Lots of the biggest web scrapers are doing this, ex. doing product monitoring products for e-commerce companies, etc. Lots of competition there, but you can do it in new markets, like NFTs, etc.
- hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.
Do you have any examples of such sites?
> hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.
what kind of web data would they be interested in?
With all that data you can do stuff like make heatmaps from pricing data, figure out the most attractive areas for certain profiles (singles, families, ...). You could then mash up that data to produce things like a "Walkscore" or let people indicate what's important for them (green areas, bars & restaurants, time & distance to other destinations, even crime levels) and then show real estate that meets their criteria.
Some sites in the US already show this but in other countries that's not the case, while the data's all there just to grab.
Most likely it wouldn't be legal and certainly not if you made money from it. But it's incredibly fun and hugely useful. Maybe that could get you started on some ideas!
They pay $500/mo for access to a bot that will allow them to make these purchases.
Most of the community lives on discord.
But what other than artificial scarcity drives people to spend hundreds of dollars on bots to snipe sneakers?!
- Scraping public information from government websites to do analysis: ethical, it's the public's data
- Scraping to help some companies customers more effectively use that companies product, for example scraping a medical office's insurance claims to help them automate their insurance remittance process: ethical
- Scraping faces to build a surveillance-tech company: disgusting
- Scraping your own website because your internal processes are so broken you can't get it any other way: ethical
- Scraping to just copy someone's data they worked hard to generate to go and resell: unethical
Political advocacy orgs rely a lot on scraping to collect political representative data that isn't available through any other means.
- Scraping photos to create deep learning VQGAN+CLIP art generator: ethical
.. we can go on and on, but we should all agree scraping is a useful tool that should never be outlawed.
Wanted to include a slightly different application:
- Scraping multiple websites and organizing data in a new and useful way for customers: To me this would be ethical since it produces new value and does not just copy someone else's data as-is
You do not want information to be public and/or free? Put it under login and charge for it.
You want to prevent people to reuse the data you publish to build other (potentially competitive) products, then use licensing and copyright, and the law.
However, banning a technological mean because what a minority could potentially do with it? Then make the internet illegal then and the problem is fixed altogether.
And, I imagine, lots of A/B testing geared toward exactly that...keeping them on Google-owned properties.
Scraping is simply a way to get data. I used to run a team that was paid by large government contractors in the US to scrape their job posts from their career portals, and then deliver those posts via email, fax and snail mail to veteran's service officers near the job opening. It was required by regulation, and the only way to get the job data was to scrape.Many enterprise applicant tracking systems did not have a good way to automatically deliver that data or wanted $millions for that capability. Scraping was the best way and in some cases, the only way.
By the way, search engines like Google are scrape data and index it.
However, there are a lot of web scraping use cases which are beneficial to the site being scraped and actually add value. Two examples:
- Google: Ahrefs & SEMRush scrape Google so they can provide SEO analytics to companies looking to grow their companies. Googles keyword analytics aren't great, so Google has effectively outsourced providing a good analytics tool to Ahrefs & SEMRush who products increase the value of the Google SERPs ecosystem.
- Amazon + Other E-Commerce: Amazon wants brands and 3rd party stores to list products on their site, and the companies scraping Amazon to provide product placement tools to their users make it easier and more profitable to list products on Amazon. Leading to more and more companies listing products on Amazon.
Archiving is unethical?
The best way to stop someone trying to make a buck on your hard work is to go direct to their customers and do a better job. If you can't, what they're selling is something on top of your offering and you aren't serving that market, and you either should start serving it, or make a deal so the scrapers can continue to do it without impacting your service.
As someone that had to do scraping in the past, and went through having a free open API that served our needs perfectly replaced with an account based one that required we make 100x the queries, it was really frustrating that the company refused to even respond to queries for specific business accomodations to data.
- There is no external API for getting scheduled streams or when they have gone live AFAIK. This lets me be notified of new stuff to watch.
- The API for getting a channel's members is locked down. I applied for access to it 6 months ago and haven't heard anything about it from YouTube so I just scrape it to give members perks.
Why even bother having the API there - so much value can be added by people building on top of YouTube and other large sites, its a shame that most of these large sites do nothing to provide API access and people have to go out of their way to scrape them them...
Let's say you made a recipes website and I would like to build an app that will order the ingredients for a meal.
It would be useful to extract the recipes, so that I can create experiences like users picking a meal and have the ingredients delivered.
I guess I can't show your recipes as it can be copyright infringement but I can link it to you and sell the tomatoes.
Also, despite copying someones work is unethical and likely illegal , there is nothing unethical or illegal to use computers to analyse the data out there. I should be able to analyse recipe publications just as I can measure the air pollution. The web scarping comes in since the semantic web never happen.
I think, we all should be able to use other people's work to build something else on top of it. Of course I do not advocate outright taking it and re-sell it as of ours.
For example, I would like to be able to create an app with Netflix content but obviously I don't expect to be able to stream their content as if it is mine. What I should be able to do is to create an app with an experience designed by me that lets you stream their movies if you pay them.
How would scraping, say, reddit, differ from the business model of Reddit itself?
> those that are doing it to my platforms are doing so solely to steal data
What kind of data are you talking about?
However, the disgusting data brokers that employ most of the custom scrapers, are usually unethical. That's why I don't trust any person or company that admits being involved professionally in "scraping", because most of the time that means "we collect personal information that got leaked elsewhere and sell them on".
I work for an ecommerce company and we scrape competitors for price information. Should this automated process using API’s not be okay, we’ll have humans do it. Less efficient for us, more traffic for a competitor. Should they provide a paid API with price information available, I’m sure we’d pay.
Examples include rotating proxies, rotating user agent headers. Hooks to add in middleware for processing pipelines. CLI switches to change your data output format. Nice debugging and logging.
Other large scale features include distributed crawlers. Scheduling. Monitoring UI so you can see progress via a web UI.
It’s what I reach for first, because you can be up and running with your first scraper in an hour. By hand, that’s maybe 10 minutes - but if you want to iterate, and your first scraper is a v1 rather than final effort… i think it’s definitely worth it.
You think google bots read contracts before scraping website? really? :) If you had any experience in creating websites and launching them online, you would know how fast and often they arrive and how they do not care about your TOS. So the real 'violation' numbers might be very scary...for you.
https://ironcladapp.com/journal/contract-management/are-brow...
Suppose I point a scraper at site S1, which has terms of service that say scraping them is OK, and my scraper finds a link on S1 to S2 and follows that, and follows a link from S2 to S3, and so on.
At some site Sn far enough down that chain is it really possible to use the scraper accessing that site to infer my intent to accept Sn's contract? The connection between me and Sn seems tenuous enough that it might be hard to even argue that I intended to visit Sn, let alone use that to infer acceptance of their contract.
It isn't clear to me that there is. The difference seems to lie in intent.
You could maybe nail a group making many requests without using the data for anything as making many spurious requests and hence having ill-intent, I suppose. Maybe having dedicated servers for such a tasks prove it even more?
Another danger is when public but not easily accessible data is able to deanonymize datasets which is probably the norm rather than the exception for anonymized datasets. Sure there are technical measures to make it better, but at the end of the day I think a lot of privacy is about respecting social boundaries and not breaking these protection measures even if technically possible. Most of the time, these measures are really about keeping honest people honest and not about stopping dedicated attackers.
I think we can't make broad statements saying that web scraping is ethical or unethical, it isn't that black or white. It really depends on what is being scraped, how is it being used, and the intention of the scraper.
It would be interesting to know if that data can be used in a court case against a government agency though.
If all the pages to be retrieved are known a priori, before retrieval begins, then one would likely call that "scraping". Whereas if not all pages are known before retrieval begins, then one would likely call that "crawling".
There's a whole sneaker collecting subculture. Some buy and wear while others just collect. The big names in sneakers do release limited production models or limited runs of certain color combinations.
Similar to any other collecting subculture.
For me, this kind of product is part of the "bullshit economy" - similar to "bullshit jobs", this kind of product has no reason to exist other than vanity, as almost all of these "collectibles" won't ever be used. We are using up valuable, finite resources to create and distribute this kind of useless "bullshit product", we are using up valuable human time and IT resources on developing websites capable to resist (D)DoS attacks and on developing snipers to bypass the anti-bot technologies employed by the shops, and we are creating a lot of demand for all kinds of sneaker-related crime - and there's a lot of that: theft and robberies from stores, theft and robberies in the supply chain, ebay/classifieds scams, credit card fraud, robberies on broad daylight [1].
Seriously, fuck all that shit. No one needs hundreds of dollars worth of sneakers that only incentivize crime and bullshit.
[1]: https://www.google.com/search?q=man+robbed+because+of+sneake...
No, that causes the problem. That encourages people to use bots to be the first one to purchase the moment the inventory is released.
I don't understand how a random raffle would ever not be fair (with the assumption that one person gets only one entry)
There's also a sort of diminishing returns effect here. If google trains people that the snippet is good enough, less traffic goes to the site. Eventually, enough to shutter the site, for some sites. Then nobody has the info.
The pattern has already affected Google referral traffic to Wikipedia. Pageviews for Wikipedia are roughly flat from 2012 to today, where they had marked growth prior. 2012 is when Google starting rolling out their knowledge graph that presented Wikipedia data directly.
Any other scraping, especially when ignoring robots.txt, is unsolicited. And if said website takes additional advanced anti-scraping measures, and you persist in bypassing that too, then to me you're clearly unethical, even if it's technically legal.
"It's public" is a legal defense, not an ethical one. It's public for readers, not for scrapers. It's public within the original context of the website, which may include monetization.
Photographing every page of a book and then reading it that way may be legally allowed, but it's still unethical.
I have somebody in our neighborhood that instead of paying for private trash, takes tiny bags of his private trash to the park and dumps it into the public trash cans.
Legal? Yes. Parasitic behavior? Also yes.
You failed to make a meaningful counterpoint; the legal/ethical distinction was made clear in the parent post.
I suppose it just comes to down to your own morals, but I see nothing at all unethical about scraping a site for personal use provided that it's done gently enough to avoid DoS or disruption. The idea that saving webpages to read later is parasitic or unethical if a website uses robot.txt to discourage commercial scrapers and data-mining goes way too far.
The article talks of large scale scraping, which includes all kinds of bypassing tools, proxies, hardware, or commercial services that abstract this away.
This industrial scale level of scraping is not the same thing as you saving a local copy of 3 web pages. The scale is a million times bigger and for sure it will not be for personal use.
The first mover advantage is so huge in this case that without allowing scraping, it's hard to understand how anyone could ever compete with these monoliths.
They don't.
I myself wrote a webserver, albeit a specialised one and for curiosity, I also created a few pages which were in no way accessible unless you knew its web address, there were no links to these pages from the home page or anything, I didn't even tell anyone about these webpages and yet in my logs, I could see those webpages were being spidered!
My robots.txt was setup as an instruction to proceed no further, so I think there is other feedback mechanisms guiding the spiders but I havent worked out if its from the web browser, or actual infrastructure like switches or routers.
Admittedly this was before HTTPS became common.
> [...] Googlebot and other respectable web crawlers obey the instructions in a robots.txt file [...]
If you're saying this is a lie, please provide sources