PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
Jian Zhang, Zoubin Ghahramani and Yiming Yang
In: NIPS, 13-18 Dec 2004, Vancouver, Canada.


In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichlet-multinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.

Postscript - Requires a viewer, such as GhostView
EPrint Type:Conference or Workshop Item (Poster)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Information Retrieval & Textual Information Access
ID Code:771
Deposited By:Zoubin Ghahramani
Deposited On:30 December 2004