An open source API for web scraping

An open source API for web scraping(github.com)

19 points by owainlewis 11 years ago | 10 comments

owainlewis 11 years ago |

An example showing how to grab all the stories from the Hacker News homepage

https://falkor-api.herokuapp.com/api/query?url=https://news....

_jomo 11 years ago |

Title should probably contain 'Show HN:' ?

Very interesting though. Just tried scraping twitter and it works great: https://falkor-api.herokuapp.com/api/query?url=https://twitt...

Edit: works great as long as there are no quotes, hashtags, or links in the tweets. Is it possible to include sub-elements?

So basically this is a DOM API in JSON. Simple, but I like it.

Any plans to add JSONP support?

owainlewis 11 years ago | |

Hey. Thanks. Yeah I will add a ton of features over the next few days. JSONP should be an easy one. Feel free to add an issue in Github and I'll get it done for you.

Only really started hacking around on the idea the other day so early stages. Want to add filters so you can say "grab me only the text" or "grab me just the class names". Obviously another step would be to grab multiple elements in one request.

getriver 11 years ago |

A better error message would be helpful. For example I tried to do: https://falkor-api.herokuapp.com/api/query?url=https://kodin..., all I got was "Request failed"

owainlewis 11 years ago | |

That's a good point. I pretty much wrote this in an evening or two so haven't had time to refine it much. But yeah error messages will definitely be improved. It's because of the way URLs are handled in the underlying web app. Will be an easy fix.

Jake232 11 years ago |

Cool idea. This could easily be extended to support something like a proxy pool; that way you can rate limit / rotate proxies for X domain globally at this server level. That way it's across all your projects, rather than having to do it on a per project basis.

Adding xPath support as well as CSS selectors would be a good addition.

owainlewis 11 years ago | |

Will definitely do something with caching and rate limiting when I get some time. These queries are quite expensive so definitely needs a bit of work in those areas.

owainlewis 11 years ago |

An example query that extracts all the images from the Digg.com homepage.

https://falkor-api.herokuapp.com/api/query?url=http://digg.c...

curiously 11 years ago |

Pretty interesting. Wrote a web scraping api you can paste in to your browser and download results last year but took it down to work on another project. You can take look at what a url could look like.

https://web.archive.org/web/20140420162639/http://scrape.ly/

For example if you wanted the profile of authors of today's stories

    http://scrape.ly/s/{http://news.combination.com}
    {'ueoma87'}*{'next':'Next Page'}{'karma':'331', 
    'username':'ueoma87'}

Would've returned all the profiles of each story's author today and yesterday and so on.

owainlewis 11 years ago | |

Thanks. This looks really interesting. I may well borrow some ideas ; )