Large scale topic modeling made practical
Bjarne Ørum Wahlgreen and Lars Kai Hansen
In: Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on, 18-21 Sep 2011, Beijing, China.
Topic models are of both significant theoretical and practical interest. They are used for query expansion and retrieval organization in
information retrieval and may be important components in services such as recommender systems and in user adaptive advertising. With very large document databases, storing entire corpora in system memory may be impossible and distribution is paramount. We propose parallel computation of a Probabilistic Latent Semantic Indexing (PLSI) model estimated by Non-negative Matrix Factorization (NMF) with
distribution along the document dimension. Liu et al. applies a fine grained MapReducea implementation to set an NMF decomposition ’world record’, which we challenge both in size of the problem solved and in time per iteration.
A term list based on WordNet reduces the memory footprint on each compute node while maintaining accuracy at par with a much larger case specific