PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Large scale topic modeling made practical
Bjarne Ørum Wahlgreen and Lars Kai Hansen
In: Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on, 18-21 Sep 2011, Beijing, China.

Abstract

Topic models are of both significant theoretical and practical interest. They are used for query expansion and retrieval organization in information retrieval and may be important components in services such as recommender systems and in user adaptive advertising. With very large document databases, storing entire corpora in system memory may be impossible and distribution is paramount. We propose parallel computation of a Probabilistic Latent Semantic Indexing (PLSI) model estimated by Non-negative Matrix Factorization (NMF) with distribution along the document dimension. Liu et al. applies a fine grained MapReducea implementation to set an NMF decomposition ’world record’, which we challenge both in size of the problem solved and in time per iteration. A term list based on WordNet reduces the memory footprint on each compute node while maintaining accuracy at par with a much larger case specific vocabulary.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Poster)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:9225
Deposited By:Bjarne Ørum Wahlgreen
Deposited On:21 February 2012