Distances Between Distributions: Comparing Language Models
Thierry Murgue and Colin de la Higuera
In: SSPR 2004, 18-20 Aug 2004, Lisbon, Portugal.
Language models are used in a variety of fields in order to support
other tasks: classification, next-symbol prediction, pattern
analysis. In order to compare language models, or to measure the
quality of an acquired model with respect to an empirical
distribution, or to evaluate the progress of a learning process, we
propose to use distances based on the L2 norm, or quadratic
distances. We prove that these distances can not only be estimated
through sampling, but can be effectively computed when both
distributions are represented by stochastic deterministic finite
We provide a set of experiments showing a fast convergence of the
distance through sampling and a good scalability, enabling us to use this distance to decide if two distributions are equal when only samples are provided, or to classify texts.