Lossless Compression based on the Sequence Memoizer
Jan Gasthaus, Frank Wood and Yee Whye Teh
In: DCC 2010, 24-26 Mar 2010, Snowbird, Utah, USA.
In this work we describe a sequence compression method based on combining a Bayesian nonparametric sequence model with entropy encoding. The model, a hierarchy of Pitman-Yor processes of unbounded depth previously proposed by Wood et al (2009) in the context of language modelling, allows modelling of long-range dependencies by allowing conditioning contexts of unbounded length. We show that incremental approximate inference can be performed in this model, thereby allowing it to be used in a text compression setting. The resulting compressor reliably outperforms several PPM variants on many types of data, but is particularly effective in compressing data that exhibits power law properties.