Randomized Algorithms for Fast Bayesian Hierarchical Clustering
Katherine Heller and Zoubin Ghahramani
In: PASCAL Statistics and Optimization of Clustering Workshop, Windsor, UK(2005).
We present two new algorithms for fast Bayesian Hierarchical Clustering on
large data sets. Bayesian Hierarchical Clustering (BHC) is a method for agglomerative hierarchical clustering based on evaluating marginal likelihoods of a probabilistic model. BHC has several advantages over traditional distance-based agglomerative clustering algorithms. It defines a
probabilistic model of the data and uses Bayesian hypothesis testing to
decide which merges are advantageous and to output the recommended
depth of the tree. Moreover, the algorithm can be interpreted as a novel fast
bottom-up approximate inference method for a Dirichlet process (i.e. countably infinite) mixture model (DPM). While the original BHC algorithm has O(n^2) computational complexity, the two new randomized algorithms are O(n log n) and O(n).