PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Building compact language models incrementally
Vesa Siivola
In: Second Baltic Conference on Human Language Technologies, Tallinn, Estonia(2005).


In traditional n-gram language modeling, we collect the statistics for all n-grams observed in the training set up to a certain order. The model can then be pruned down to a more compact size with some loss in modeling accuracy. One of the more principled methods for pruning the model is the entropy-based pruning proposed by Stolcke. In this paper, we present an algorithm for incrementally constructing an n-gram model. During the model construction, our method uses less memory than the pruning-based algorithms, since we never have to handle the full unpruned model. When carefully implemented, the algorithm achieves a reasonable speed. We compare our models to the entropy-pruned models in both cross-entropy and speech recognition experiments in Finnish. The entropy experiments show that neither of the methods is optimal and that the entropy-based pruning is quite sensitive to the choice of the initial model. The proposed method seems better suitable for creating complex models. Nevertheless, even the small models created by our method perform along with the best of the small entropy-pruned models in speech recognition experiments. The more complex models created by the proposed method outperform the corresponding entropy-pruned models in our experiments.

Postscript - Requires a viewer, such as GhostView
EPrint Type:Conference or Workshop Item (Oral)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:1767
Deposited By:Vesa Siivola
Deposited On:28 November 2005