Model Selection and Stability in k-means Clustering
Ohad Shamir and Naftali Tishby
In: COLT 2008, 9-12 July 2008, Helsinki, Finland.
Clustering Stability methods are a family of widely used model selection techniques applied in data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known theoretically on why or when do they work, or even what kind of assumptions they make in choosing an 'appropriate' model. Moreover, recent theoretical work has shown that they might 'break down' for large enough samples. In this paper, we focus on the behavior of clustering stability using $k$-means clustering. Our main technical result is an exact characterization of the distribution to which suitably scaled measures of instability converge, based on a sample drawn from any distribution in $\reals^n$ satisfying mild regularity conditions. From this, we can show that clustering stability does not 'break down' even for arbitrarily large samples, in the $k$-means framework that we study. Moreover, it allows us to identify the factors which influence the behavior of clustering stability for any sample size. This leads to some interesting preliminary observations about what kind of assumptions are made when using these methods. While often reasonable, these assumptions might also lead to unexpected consequences.
|EPrint Type:||Conference or Workshop Item (Oral)|
|Project Keyword:||Project Keyword UNSPECIFIED|
|Subjects:||Theory & Algorithms|
|Deposited By:||Ohad Shamir|
|Deposited On:||14 July 2008|