PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

The Sequence Memoizer
Frank Wood, Jan Gasthaus, Cedric Archambeau, Lancelot James and Yee Whye Teh
Communications of the ACM Volume 54, Number 2, pp. 91-98, 2011.


Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated.

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
ID Code:7900
Deposited By:Jan Gasthaus
Deposited On:17 March 2011