Amazon2csv: Amazon products scraper to CSV (no API token required)

Amazon2csv: Amazon products scraper to CSV (no API token required)(github.com)

129 points by tducret 7 years ago | 56 comments

yoaviram 7 years ago |

Why not use the API? Disclaimer: I'm the author of python-amazon-simple-product-api [1]

[1] https://github.com/yoavaviram/python-amazon-simple-product-a...

k__ 7 years ago | |

Sometimes this isn't possible.

I wrote an app that is basically a new UI for the Amazon products. It runs entirely on the client. The Amazon API simply didn't work in that setup.

AznHisoka 7 years ago | |

Are you referring to the Product Advertising API?

Doesnt that require you to have a quota of affiliate sales to keep using it? I can’t find where they state this requirement but I remembered they were very sneaky about disclosing this. If you dont have any affiliate sales after X months, your API key will stop working.

ZoomStop 7 years ago | | |

Currently you have to be a member of their affiliate program to get API access. To become a full "member" you have to be a prospect who generates three referral sales (iirc) within a 30 day period. So once in you have the API, but getting in isn't as easy as filling out a form. From there you can get your API rate limits increased from the default 1x call per second up to 10 based on your prior 30 day affiliate sales.

raitucarp 7 years ago | |

Man, looks great. I also build something similar in node.js. I implement everything what documentation said (complete implementation). ICYMI:

https://github.com/Ribhnux/piranhax

wdr1 7 years ago | |

The API comes with a TOS that severely restricts what you can do with the data.

amingilani 7 years ago |

Scraping Amazon is fun and all, but when you start overdoing it they rate-limit your IP and show you my worst nightmare: the Dogs of Amazon (a 500 page with pictures)

Why do I know this? Because I'm the CTO at Nazdeeq.com where we let users buy Amazon products from countries where they don't ship easily, like Pakistan.

Edit: totally open to partnerships in more countries

jeanlucas 7 years ago | |

I'm from Brazil and what you said made me curious, not sure why, but Amazon here didn't catch. How did you solve problems like logistics and interest from the public?

amingilani 7 years ago | | |

I'm sorry, I have trouble understanding your question but if you mean how we ship from Amazon to Pakistan, and how we got people to use our service: we worked out a pipeline to get products from the US to Amazon, and advertising + word-of-mouth. Also:

+ There's no direct way to buy 90% of products from Amazon since they don't ship to Pakistan

+ Our service is the only in the country that gives a fixed price at checkout in PKR

+ Our customer service is excellent

+ We're one of the cheapest options available, as long as the competition imports products legally.

yasoob 7 years ago | |

Hi Amin, your platform seems nice. Just wanted to give you a heads-up that your website is being classified as ["phishing" by Avast](https://i.imgur.com/SmuuRfD.png). I think if you replace "Amazon" in the url with something else it should work fine. Best of luck!

always_good 7 years ago | | |

Reminds me of how nobody could see one of my user's avatars because the url (a hash) had started with an "ad" segment (for bucketing), as in "/avatars/ad/ad3adb33f". So adblockers blocked it.

My protest against such a ridiculous heuristic was to not fix it.

amingilani 7 years ago | | |

Thank you Yasoob! Dammit, again? I already had them white-label our site once but I'll look into this again. Thank you!

jploh 7 years ago | |

In the Philippines there's something quite similar called Galleon. They've been recently acquired but I think they might be open to partner. They've expanded to Thailand, if I'm not mistaken.

dewey 7 years ago | |

Are you using the API or web scraping? We never really had problems with IP banning if the traffic looks like a real user.

amingilani 7 years ago | | |

Neither, actually, we're using a heavily configured reverse-proxy.

This means that, unfortunately, all the traffic has to go through our own servers.

Jdam 7 years ago |

The issue with those tools is that Amazon changes the product layout very often and heavily conducts A/B tests. I’ve once even heard that computer vision is the most stable way to scrape Amazon. I guess this library will stop working rather soon.

RhodesianHunter 7 years ago | |

>I’ve once even heard that computer vision is the most stable way to scrape Amazon

At a former employer we scraped Amazon many millions of times per day with very simple old tools that rarely needed updating.

mxvzr 7 years ago | | |

Are you able to share some details? How often did you have to get new IP addresses? What about user agent? Were the scapers "straight to the point" like amazon2csv (ie: make a request directly to the search page) or did they have randomized behavior (eg: re-use sessions from time to time; click a random link on the page; start from the homepage...)? Did the scrapers ever went against amz's robots.txt directives (eg: interacting with the cart page)? Ever heard from amz itself about your employer's activities on their site?

AznHisoka 7 years ago | | |

Same here. Scraping their search results page is easy if you have a bunch of IPs. No manipulation or workarounds needed(ie headless browsers, ensuring your http headers look like a real user).

I have not scraped a ton of actual individual product pages though so cant testify about scraping that. I do remember it might have been harder.

mygo 7 years ago | |

> I guess this library will stop working rather soon.

Don’t really see that as a dealbreaker. So the library will need maintenance. Normal for libraries to need updates in order to keep up with changes. It works today, and it will work whenever it’s updated. Better than nothing and for many use cases that’s good enough.

hobofan 7 years ago | |

Search results scraping on Amazon is fairly stable.

What's more difficult is product page scraping, because there you have hundreds of different variations. Some from A/B testing and a lot just being specific things that show up for certain product categories (e.g. video games).

bufferoverflow 7 years ago |

I remember trying to build a scraper for Amazon. I quickly discovered that there are many types of item pages, and they change over time too. A/B testing probably. Just to get the price of the product out of their HTML markup reliably was a nightmare, I had to build a huge tree of if-this-then-maybe-that logic.

AdamRoberts 7 years ago |

The company I work for (zinc.io) has this: https://zincapi.com/

We brand it as an ordering API, but we also offer retrieving the product data (item details/pricing.) We put a LOT of engineering resources into data quality and maintenance, as the API is core to our flagship product, PriceYak. If you have questions or want a token, email adam@zinc.io and mention this post.

ikeboy 7 years ago |

If you're using this for anything serious, it's probably better to sign up for the keepa API at about $50/month and they scrape Amazon for you. Worth it to not need to deal with the complexities.

AdamM12 7 years ago |

Nice. From my experience I've found Parsel [1] (used by scrapy) to be an easier to use HTML parsing library than Beautiful Soup. That's just imo.

[1] https://github.com/scrapy/parsel

microdrum 7 years ago |

Hm, another no-API option (at least if you are on WordPress) is: https://wpcommission.com

alex_sp 7 years ago |

So how many calls is one allowed before getting banned? Any guidelines on how to use this without breaching T&Cs?

staticautomatic 7 years ago |

Am I the only one who thinks this is rather weird, or at least unconventional code for a scraper in Python?

dec0dedab0de 7 years ago | |

I just took a glance, but nothing seemed too off. Do you care to elaborate?

staticautomatic 7 years ago | | |

Sure. I'm not really trying to criticize the code, it's just that a lot of this looks foreign and unconventional to me.

1. requests.Session() is a class. IDK what request.session() invokes (see https://github.com/tducret/amazon-scraper-python/blob/master...).

2. Isn't one of the points of using Session() that it'll persist stuff like cookies and headers? So why is it re-defining the headers multiple times? (e.g. both GET and POST in the same session have their own respective but identical headers).

3. Is the use of `arg=""` idiomatic? For example in https://github.com/tducret/amazon-scraper-python/blob/master...

4. Using raw list indices without some kind of helper function to catch index and other errors when parsing is not really a good idea in scraping (e.g. `selection[0].text.strip()`.

RobLach 7 years ago | |

If it works...

kull 7 years ago |

It is also illegal to scrape AZ, since if you scrape it , it means you don’t own this content and you are just stilling products data added to the site by produsts proper owners.

zeusk 7 years ago | |

why aren't Larry and Sergey behind bars, then? Scraping publicly available information is far from illegal.

Also, Interestingly only Alibaba's bots are completely blocked from crawling: https://www.amazon.com/robots.txt

stef25 7 years ago | | |

> Scraping publicly available information is far from illegal.

The scraping itself may not be (although I'm pretty sure here in Belgium there is a law against collecting other people's data), but what you do with it may not be legal.

You could make a case for making any kind of profit generated from scraping data illegal. Don't get me wrong, I love scraping things myself.

Also find it amazing there are companies out there like Crawlera that can do serious scraping work and openly flaunt deploying tech to get around whatever scraping blockers are out there.

kull 7 years ago | | |

Check amazon api T&C, also try to do the same with Craigslist and see how long you they will let you do it. scraping data is always a shady business if you do it without a permission of content owner

smt88 7 years ago | |

Why would the owner of a product want to keep their product info a secret?

kull 7 years ago | | |

Ex. People take products data and copy to eBay then try to dropship getting products from your fba. People pay big money for nice photos of products and then somebody just comes and takes it as their own