PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Emerge and spread models and word burstiness
Peter Sunehag
In: Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (2007) Society for Artificial Intelligence and Statistics , San Juan, Puerto Rico , pp. 540-547. ISBN 0-9727358-2-8


Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a crossentropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Information Retrieval & Textual Information Access
ID Code:4047
Deposited By:S V N Vishwanathan
Deposited On:25 February 2008