[1] https://github.com/yoavaviram/python-amazon-simple-product-a...
I wrote an app that is basically a new UI for the Amazon products. It runs entirely on the client. The Amazon API simply didn't work in that setup.
Doesnt that require you to have a quota of affiliate sales to keep using it? I can’t find where they state this requirement but I remembered they were very sneaky about disclosing this. If you dont have any affiliate sales after X months, your API key will stop working.
Why do I know this? Because I'm the CTO at Nazdeeq.com where we let users buy Amazon products from countries where they don't ship easily, like Pakistan.
Edit: totally open to partnerships in more countries
+ There's no direct way to buy 90% of products from Amazon since they don't ship to Pakistan
+ Our service is the only in the country that gives a fixed price at checkout in PKR
+ Our customer service is excellent
+ We're one of the cheapest options available, as long as the competition imports products legally.
My protest against such a ridiculous heuristic was to not fix it.
This means that, unfortunately, all the traffic has to go through our own servers.
At a former employer we scraped Amazon many millions of times per day with very simple old tools that rarely needed updating.
I have not scraped a ton of actual individual product pages though so cant testify about scraping that. I do remember it might have been harder.
Don’t really see that as a dealbreaker. So the library will need maintenance. Normal for libraries to need updates in order to keep up with changes. It works today, and it will work whenever it’s updated. Better than nothing and for many use cases that’s good enough.
What's more difficult is product page scraping, because there you have hundreds of different variations. Some from A/B testing and a lot just being specific things that show up for certain product categories (e.g. video games).
We brand it as an ordering API, but we also offer retrieving the product data (item details/pricing.) We put a LOT of engineering resources into data quality and maintenance, as the API is core to our flagship product, PriceYak. If you have questions or want a token, email adam@zinc.io and mention this post.
1. requests.Session() is a class. IDK what request.session() invokes (see https://github.com/tducret/amazon-scraper-python/blob/master...).
2. Isn't one of the points of using Session() that it'll persist stuff like cookies and headers? So why is it re-defining the headers multiple times? (e.g. both GET and POST in the same session have their own respective but identical headers).
3. Is the use of `arg=""` idiomatic? For example in https://github.com/tducret/amazon-scraper-python/blob/master...
4. Using raw list indices without some kind of helper function to catch index and other errors when parsing is not really a good idea in scraping (e.g. `selection[0].text.strip()`.
Also, Interestingly only Alibaba's bots are completely blocked from crawling: https://www.amazon.com/robots.txt
The scraping itself may not be (although I'm pretty sure here in Belgium there is a law against collecting other people's data), but what you do with it may not be legal.
You could make a case for making any kind of profit generated from scraping data illegal. Don't get me wrong, I love scraping things myself.
Also find it amazing there are companies out there like Crawlera that can do serious scraping work and openly flaunt deploying tech to get around whatever scraping blockers are out there.
For example, Scrapinghub's Crawlera (the guys behind the Scrapy python lib)
LinkedIn had multiple layers of scraping detection systems deployed, and went to significant efforts to block their data from getting scraped[1].
Last year, they were ordered by a Federal court explicitly to allow scraping of content and remove systems that were designed to impede and block scraping efforts[2].
There's no clear law (in the US) directly aimed at scraping, and repurposing anti-hacking laws brought up the murky definition of what is unauthorized access. If a judge clarifies explicitly that scraping is not unauthorized access (which happened in this case, although needs to stand up to appeal[3]), then entities that are interested in preventing scraping have lost one of their core legal underpinnings. It demonstrates why companies like Crawlera have been able to flaunt the type of serious scraping work they do, because it's hard to bully people with a legal argument that has been debunked and affirmed as debunked on appeal. So it's better to avoid the risk of setting that precedence entirely until you can't avoid it.
[1] https://techcrunch.com/2016/08/15/linkedin-sues-scrapers/
[2] https://www.reuters.com/article/us-microsoft-linkedin-ruling...
[3] https://www.courthousenews.com/linkedin-takes-data-scraping-...
This raises an interesting question: if someone had a product on Amazon and had product photos they took, does Amazon still allow other sellers to use the same listing? In other words, does the seller agreement allow Amazon to reuse your (potentially expensive to produce) product photos on your competitor's product listing?
2. It doesn't really matter and maybe it's so they are kept closer since they are modified, Session merges your call provided and its own headers (yours take precedence) and it still handles the cookies if you provide own headers. Session also has the benefit of connection pooling so it's quicker to do more than one request with it[2] (normal get, post, etc. in requests module go through request function in the end which actually constructs a Session for that single request).
3. What's wrong there? It's just a default argument. Strings aren't mutable so it avoids this pitfall[4]. Is the " quote a problem here? It's a matter of taste/style. PEP8 is silent on it[3] and just say to pick a convention and there seems to be one here. Some people (me too) also use single quotes for non-human readable strings and double quotes for human readable strings.
4. If you mean here[5] then there is a len just above it to catch the 'expected' error/missing element, just the .text part is unchecked. As for the general lack of checks - I don't put them into my GreaseMonkey or random Python one off scraping code either. Site layout is invariant of a certain version of a scraper script so if some field is missing or something like that then the website layout must have changed and the entire script probably needs to be reworked (or the field is not always present there in the first place so the script is also useless in that particular scraping case) and might as well crash (or if its used by someone they can catch the exception). Either way (crash or catch) when something you expected to surely be there is missing the results are not coming or might be wrong. That code as it is now anticipates that there might not be such an element but if there is it must have the expected field. If the site has been observed to always work like that (certain element might be missing but surely has that field when its there) then script just works like that and guards against the first expected possibility (missing element) but not the second (missing field) since if how site is laid out, how data is stored in elements, etc. changed significantly, then the script also needs changes or risks producing bad or incomplete output (arguably worse as a default than a loud failure would be, it also depends on what you're doing and what the scrape is for).
I'd assume most users and programmers would rather get an error than have script return an empty list (despite there being content up there) just because the layout changed. The only other solution (other than return a wrong result by design and hide the errors or log them somewhere where no one cares to read anyway) would be to catch such exceptions somewhere high and either pack them into a new exception that is thrown with more information (what URL, what element content was exactly, entire response text, etc.) but that's probably too much care/work for such a one off script OR throw your entirely own ones from some low place, but it's vanity then because Python exceptions point really strongly to where they were thrown and in what call stack so it's just as clear what was broken without the need to add lots of checks yourself and throw a "element X is missing field Y that should always be there" message.
[1] - https://github.com/requests/requests/blob/master/requests/se...
[2] - http://docs.python-requests.org/en/master/user/advanced/#ses...
[3] - https://www.python.org/dev/peps/pep-0008/#string-quotes
[4] - http://www.effbot.org/zone/default-values.htm
[5] - https://github.com/tducret/amazon-scraper-python/blob/master...
1. I never realized there was a function that just returned an instance of the class. Should've just looked it up myself.
2. I was wrong and misread the header stuff.
3. There's nothing wrong with it. It's just a convention I'm not accustomed to seeing or using. Admittedly, there are lots of ways to skin a cat with optional and default args.
4. Yeah I understand what you're saying. I guess it's a fine "greasemonkey" approach. Just more susceptible to DOM changes and code errors than I'm comfy with even for a rudimentary implementation.
I agree you'd usually want to get an error than an empty list but I think it's a little more complicated than just whether you want an error or an empty list. Sometimes you want the error but don't get it, which is why I tend to write more code around checking stuff and catching exceptions. I think the best example is you might not see an index error but the list item that's returned for your specified index isn't actually what you wanted because the DOM changed or you wrote code against one page that broke on another one you thought would be identical.
A lot of our customers have similar horror stories where their goods get stuck in customs because they didn't realize clearing is a thing.