Ask PG: HN Ngram Viewer? Since writing a scraper to discover and parse all historical comments/submissions on HN would obviously get me in trouble, would the HN admins be willing to provide a dump of the historical text/metadata from all comments and [local] submissions so I can make a HN Ngram Viewer for the HN public? I work in an academic lab where I'm one of the developers of a system that generates ngram viewers from large corpuses of text, which we call "Bookworms". Here are a few Bookworms we've created: arXiv scientific publications: http://bookworm.culturomics.org/arxiv/ US Congress legislation: http://bookworm.culturomics.org/congress/ Open Library books: http://bookworm.culturomics.org/OL/ Chronicling America historical newspapers: http://bookworm.culturomics.org/ChronAm/ Social Science Research Network research paper abstracts: http://bookworm.culturomics.org/ssrn/ We have more Bookworms in the pipeline, including historical legislation in the UK and a massive corpus of texts (70MM+ documents) from the National Library of Australia (Trove) spanning multiple centuries. A new GUI for all our Bookworms will also be rolling out shortly. (Preview: http://bookworm.culturomics.org/new_gui_teaser.png). In my opinion, HN be an awesome candidate for an ngram viewer because there are so many subsets of topics that come/go/stay here, such as the frequency of discussions about web technologies, programming languages, companies/services, the NSA, etc. If this is something the HN admins would be interested in, I'd be happy to put it together. If a privacy agreement is desired before passing off any bulk data, that is not a problem as we've gone this route before, albeit only for private ngram viewers we've created for companies, like the NYT, to use internally. |