PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Lossless Compression based on the Sequence Memoizer
Jan Gasthaus, Frank Wood and Yee Whye Teh
In: DCC 2010, 24-26 Mar 2010, Snowbird, Utah, USA.


In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of Pitman-Yor processes of unbounded depth previously proposed by Wood et al (2009) in the context of language modelling, allows modelling of long-range dependencies by allowing conditioning contexts of unbounded length. We show that incremental approximate inference can be performed in this model, thereby allowing it to be used in a text compression setting. The resulting compressor reliably outperforms several PPM variants on many types of data, but is particularly effective in compressing data that exhibits power law properties.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Natural Language Processing
Theory & Algorithms
ID Code:6695
Deposited By:Yee Whye Teh
Deposited On:08 March 2010