Modeling Word Burstiness Using The Dirichlet Distribution
Rasmus Elsborg Madsen, Charles Elkan and David Kauchak
In: ICML 2005, Bohn, Germany(2005).
Multinomial distributions are often used to
model text documents. However, they do
not capture well the phenomenon that words
in a document tend to appear in bursts: if
a word appears once, it is more likely to
appear again. In this paper, we propose
the Dirichlet compound multinomial model
(DCM) as an alternative to the multinomial.
The DCM model has one additional degree
of freedom, which allows it to capture burstiness.
We show experimentally that the DCM
is substantially better than the multinomial
at modeling text data, measured by perplexity.
We also show using three standard document
collections that the DCM leads to better
classification than the multinomial model.
DCM performance is comparable to that obtained
with multiple heuristic changes to the