Show HN: Days Since Last Elon(dayssincelastelon.com) A toy project I created to track the appearance of the text "Elon" on the front pages of various news sites. |
Show HN: Days Since Last Elon(dayssincelastelon.com) A toy project I created to track the appearance of the text "Elon" on the front pages of various news sites. |
Also maybe you shouldn't be counting news aggregators like Google News? Its basically double counting since its already on some other site.
I'm basically scanning for <a> tags and searching the text within. Doing a Google News inspect, it appears that their links actually have no text, but are sibling elements of an <h#> tag. So, I need to figure out how to parse that correctly...
I just checked Google News myself, and you are correct that the sibling <h#> tag has the text. However, the <a> tag with the link has it too, but as a prop instead of being nested inside. Unless I am mistaken about the use case of that prop here, you can just extract the text from the aria-label property of the <a> tag.
And in case you want to proceed with parsing text from the sibling <h#> tag instead, you can just get the list of the parent <article> tag children nodes (yourAnchorTagNode.parentNode.parentNode.children; had to do a double .parentNode, because the <a> tag is wrapped in a singular <div> tag) and then search for the only <h#> tag there. That will be your target tag with the text.
What seemed off to you?
Can't open Twitter without one of Elon's tweets on top of it.
I was _hoping_ to get away with the same xml-parsing for each site, but I guess I'll need to customize
My logic is that it is very unlikely that another website will copy over the exact html layout of Google News, so the <h#> is only going to work there. But I bet that Google News is far from the only website that has the article title text inside the aria-label prop in the <a> tag.
So you can cover a heavy majority of websites you care about (if not all of them) by just checking both the inner text and (in case the inner text is absent) the aria-label prop. No need for any custom logic implemented just for Google News, as it would likely solve this issue for a lot of other sources.