Ask YC: Blog parsing (WordPress,Typepad,Blogger) I'm trying to develop a crawler that knows when the page being sent to it is the actually a Post page, and not the index,search,tag,calendar(November 2008) page. I want these pages ->http://1vibe.net/music/jim-jones-ft-lil-wayne-noe-twista-jackin-swagga-from-us/ Not this ->http://1vibe.net/category/behind-the-scenes/ Not this ->http://1vibe.net/2008/11/ Not this ->http://1vibe.net/tag/50-cent/ From the blog post page I want to grab the title and date of that post The way I trying to do it was to look through the DOM of the site and look for consistency. I found consistency in Blogger and Typepad but WordPress was all over the place in the formating from site to site. So I figure I must have been doing it wrong and that there is the xml,rdf,feeds a.k.a, the intelligent way of doing it. I appreicate it if anyone could help ( also I'm doing it in php). |