PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

The Sequence Memoizer
Frank Wood, Jan Gasthaus, Cedric Archambeau, Lancelot James and Yee Whye Teh
Communications of the Association for Computing Machines Volume 54, Number 2, pp. 91-98, 2011.

Abstract

Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated.

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Natural Language Processing
Theory & Algorithms
ID Code:8129
Deposited By:Yee Whye Teh
Deposited On:24 April 2011