PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Scaling the iHMM: Parallelization versus Hadoop
Sebastien Bratieres, Andreas Vlachos, Jurgen van Gael and Zoubin Ghahramani
2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010) 2010.

Abstract

Abstract—This paper compares parallel and distributed implementations of an iterative, Gibbs sampling, machine learning algorithm. Distributed implementations run under Hadoop on facility computing clouds. The probabilistic model under study is the infinite HMM [1], in which parameters are learnt using an instance blocked Gibbs sampling, with a step consisting of a dynamic program.We apply this model to learn part-of-speech tags from newswire text in an unsupervised fashion. However our focus here is on runtime performance, as opposed to NLP-relevant scores, embodied by iteration duration, ease of development, deployment and debugging.

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
ID Code:8059
Deposited By:Jurgen van Gael
Deposited On:17 March 2011