Latent Semantic Analysis Tutorial(puffinwarellc.com) |
Latent Semantic Analysis Tutorial(puffinwarellc.com) |
Random Projection is something you should be aware of if you do any kind of large dimensional modeling. It is magic.
The biggest problems I then run into involves choosing a "k" (the dimensions allowed in your truncation). Have had some thoughts about training this unsupervised method (providing labeled data for what "oughta" be the top nearest neighbors for this particular entity, and optimizing toward that) or building an ensemble method on top of many SVD'd truncated vector spaces (though the combination method is unclear to me-- pick kNN from a linear combination of each model's outcomes? Pick the intersection of each method's k nearest neighbors?)
To novices looking at this tutorial: NumPy's a wonderful tool for small toy examples, but at a certain scale you will depend heavily on the sparse matrix formats provided for you by SciPy. (That and random projections should curb any memory problems you'll run into for many vector space-based problems, short of operating at a Google/Yahoo scale, or if your target's TBs of logging data).
http://homepage.tudelft.nl/19j49/t-SNE.html
In some cases what I get isn't all that much better than simply using PCA, but overall t-SNE is superior. Although t-SNE is dreadfully slow... Below is a link to an implementation used for text and I can highly recommend the original paper on t-SNE:
What are the differences among latent semantic analysis (LSA), latent semantic indexing (LSI), and singular value decomposition (SVD)?
http://stats.stackexchange.com/questions/4735/what-are-the-d...