Building compact language models incrementally
In: Second Baltic Conference on Human Language Technologies, Tallinn, Estonia(2005).
In traditional n-gram language modeling, we collect the statistics
for all n-grams observed in the training set up to a certain order.
The model can then be pruned down to a more compact size with some
loss in modeling accuracy. One of the more principled methods for
pruning the model is the entropy-based pruning proposed by
Stolcke. In this paper, we present an algorithm
for incrementally constructing an n-gram model. During the model construction, our method uses less memory than the pruning-based algorithms, since we never have to handle the full unpruned model.
When carefully implemented, the algorithm achieves a reasonable
speed. We compare our models to the entropy-pruned models in both
cross-entropy and speech recognition experiments in Finnish. The
entropy experiments show that neither of the methods is optimal and
that the entropy-based pruning is quite sensitive to the choice of
the initial model. The proposed method seems better suitable for
creating complex models. Nevertheless, even the small models
created by our method perform along with the best of the small
entropy-pruned models in speech recognition experiments. The more
complex models created by the proposed method outperform the
corresponding entropy-pruned models in our experiments.