PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Distances Between Distributions: Comparing Language Models
Thierry Murgue and Colin de la Higuera
In: SSPR 2004, 18-20 Aug 2004, Lisbon, Portugal.

Abstract

Language models are used in a variety of fields in order to support other tasks: classification, next-symbol prediction, pattern analysis. In order to compare language models, or to measure the quality of an acquired model with respect to an empirical distribution, or to evaluate the progress of a learning process, we propose to use distances based on the L2 norm, or quadratic distances. We prove that these distances can not only be estimated through sampling, but can be effectively computed when both distributions are represented by stochastic deterministic finite automata. We provide a set of experiments showing a fast convergence of the distance through sampling and a good scalability, enabling us to use this distance to decide if two distributions are equal when only samples are provided, or to classify texts.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
Postscript - PASCAL Members only - Requires a viewer, such as GhostView
Other (BibTeX File)
PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Speech
Theory & Algorithms
ID Code:89
Deposited By:Thierry Murgue
Deposited On:18 May 2004