A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
Jian Zhang, Zoubin Ghahramani and Yiming Yang
In: NIPS, 13-18 Dec 2004, Vancouver, Canada.
In this paper we propose a probabilistic model for online document
clustering. We use non-parametric Dirichlet process prior to model the
growing number of clusters, and use a prior of general English language
model as the base distribution to handle the generation of novel clusters.
Furthermore, cluster uncertainty is modeled with a Bayesian
Dirichlet-multinomial distribution. We use empirical Bayes method to
estimate hyperparameters based on a historical dataset.
Our probabilistic model is applied to the novelty detection task in Topic
Detection and Tracking (TDT) and compared with existing approaches in