PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Lossless compression based on the Sequence Memoizer
Jan Gasthaus, Frank Wood and Yee Whye Teh
In: Data Compression Conference 2010 (2010) IEEE Computer Society , Los Alamitos, CA, USA , pp. 337-345.


In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of Pitman-Yor processes of unbounded depth previously proposed by Wood et al. [16] in the context of language modelling, allows modelling of long-range dependencies by allowing conditioning contexts of unbounded length. We show that incremental approximate inference can be performed in this model, thereby allowing it to be used in a text compression setting. The resulting compressor reliably outperforms several PPM variants on many types of data, but is particularly effective in compressing data that exhibits power law properties.

EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
ID Code:6335
Deposited By:Jan Gasthaus
Deposited On:17 March 2011