PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Pruning nearest neighbor cluster trees.
Ulrike v. Luxburg and Samory Kpotufe
In: ICML 2011(2011).

Abstract

Nearest neighbor (k-NN) graphs are widely used in machine learning and data mining applica- tions, and our aim is to better understand what they reveal about the cluster structure of the un- known underlying distribution of points. More- over, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a k-NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our sec- ond and perhaps most important contribution is the following finite sample guarantee. We care- fully work out the tradeoff between aggressive and conservative pruning and are able to guar- antee the removal of all spurious cluster struc- tures at all levels of the tree while at the same time guaranteeing the recovery of salient clus- ters. This is the first such finite sample result in the context of clustering.

EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Theory & Algorithms
ID Code:8633
Deposited By:Ulrike Von Luxburg
Deposited On:16 February 2012