PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

A Stochastic Memoizer for Sequence Data
Frank Wood, Cedric Archambeau, Jan Gasthaus, Lancelot James and Yee Whye Teh
In: ICML 2009, 14-18 Jun 2009, Montreal, Canada.

Abstract

We propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Natural Language Processing
Theory & Algorithms
ID Code:6734
Deposited By:Yee Whye Teh
Deposited On:08 March 2010