Words growing or shrinking in Hacker News titles: a tidy analysis

Words growing or shrinking in Hacker News titles: a tidy analysis(varianceexplained.org)

118 points by var_explained 9 years ago | 14 comments

flavio81 9 years ago |

A frequent post then would be:

"Using VR to train a deep learning neural network on driving and react correctly to unexpected conditions, a bot implemented via a microservices stack using aws as a container and of course connected with cars and related traffic devices via the IoT, logging unexpected events into a blockchain."

forgot-my-pw 9 years ago | |

Missing "HN" somewhere.

Mz 9 years ago | |

Just by eyeballing it, I am pretty sure that exceeds 80 chars.

robteix 9 years ago |

I'm surprised both "NSA" and "surveillance" are two of the fastest shrinking words. I thought we saw more now than ever. Shows how perception doesn't always match reality.

jdminhbg 9 years ago | |

When the Snowden leaks first dropped, the front page was absolutely overwhelmed with NSA news, to the exclusion of nearly everything else. Would not be possible to keep that level of interest up without making this is an exclusively NSA/surveillance-driven site.

Houshalter 9 years ago | | |

IIRC the mods also soft banned it because of that. So posts with the word "NSA" in the title get penalized and ranked much lower than other posts. Hence the shrinking.

minimaxir 9 years ago |

Hmm, the BigQuery HN dataset is now updated daily and contains comments as well as stories? That's new, and I'll certainly give it another look at for my projects.

With the bigrquery R package (https://github.com/rstats-db/bigrquery), you can access the HN dataset directly from R, using dplyr syntax too. (for simple queries atleast; you can pass the raw SQL for complex queries)

As noted, the resulting dataset of words is large, so mapping the words in BigQuery itself may be more practical (using a combo of SPLIT and UNNEST with standard SQL), although of course you can't do complex operations like logistic regression or splines there.

SippinLean 9 years ago |

>I don’t currently have a guess for why “million” and “billion” had sudden dropoffs in 2014. Is it some artifact of the Hacker News policy, with the word becoming edited or deleted in newer posts? Or is it a real change in what the site discusses?

Any guesses on this one?

aswanson 9 years ago |

A more interesting analysis would be comment length.

minimaxir 9 years ago | |

In my old analysis (http://minimaxir.com/2014/10/hn-comments-about-comments/), it's not that interesting.

Comments are getting longer over time on average (http://minimaxir.com/img/hn-comments/monthly_average_words.p...), and there is a slight positive correlation between comment score and comment length (http://minimaxir.com/img/hn-comments/distribution_comment_po...), but that can't be remade with the BigQuery dataset since comment scores are no longer public.

drenvuk 9 years ago |

It would be nice to see a comparison of fastest growing words between the last 5 years vs 10 years ago. I'm wondering about the demographics of this site and if they've changed.

joecool1029 9 years ago |

I am extremely surprised rust wasn't included in here.

var_explained 9 years ago | |

I've got a followup coming about what words lead to upvotes, and rust features quite prominently there!