A Nonparametric Bayesian Approach to Modeling Overlapping Clusters
Katherine Heller and Zoubin Ghahramani
In: AISTATS 2007, Puerto Rico(2007).
Although clustering data into mutually ex- clusive partitions has been an extremely suc- cessful approach to unsupervised learning, there are many situations in which a richer model is needed to fully represent the data. This is the case in problems where data points actually simultaneously belong to mul- tiple, overlapping clusters. For example a particular gene may have several functions, therefore belonging to several distinct clus- ters of genes, and a biologist may want to discover these through unsupervised model- ing of gene expression data. We present a new nonparametric Bayesian method, the In- finite Overlapping Mixture Model (IOMM), for modeling overlapping clusters. The IOMM uses exponential family distributions to model each cluster and forms an over- lapping mixture by taking products of such distributions, much like products of experts (Hinton, 2002). The IOMM allows an un- bounded number of clusters, and assignments of points to (multiple) clusters is modeled us- ing an Indian Buffet Process (IBP), (Griffiths and Ghahramani, 2006). The IOMM has the desirable properties of being able to focus in on overlapping regions while maintaining the ability to model a potentially infinite num- ber of clusters which may overlap. We derive MCMC inference algorithms for the IOMM and show that these can be used to cluster movies into multiple genres.