PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

On the performance of clustering in Hilbert spaces
Gerard Biau, Luc Devroye and Gábor Lugosi
IEEE Transactions on Information Theory Volume 54, pp. 781-790, 2008.


Based on $n$ randomly drawn vectors in a separable Hilbert space, one may construct a $k$-means clustering scheme by minimizing an empirical squared error. We investigate the risk of such a clustering scheme, defined as the expected squared distance of a random vector $X$ from the set of cluster centers. Our main result states that, for an almost surely bounded $X$, the expected excess clustering risk is $O(\sqrt{1/n})$. Since clustering in high (or even infinite)-dimensional spaces may lead to severe computational problems, we examine the properties of a dimension reduction strategy for clustering based on Johnson-Lindenstrauss-type random projections. Our results reflect a tradeoff between accuracy and computational complexity when one uses $k$-means clustering after random projection of the data to a low-dimensional space. We argue that random projections work better than other simplistic dimension reduction schemes

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:3929
Deposited By:Gábor Lugosi
Deposited On:25 February 2008