PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Clustering processes
Daniil Ryabko
In: ICML 2010, June 21-24, 2010, Haifa, Israel.


The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic no- tion of consistency, and show that simple con- sistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribu- tion. With this notion of consistency, cluster- ing generalizes such classical statistical prob- lems as homogeneity testing and process clas- sification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary er- godic (no parametric or Markovian assump- tions, no assumptions of independence, nei- ther between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assump- tions on the mixing rates of the processes. In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.

EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Theory & Algorithms
ID Code:7207
Deposited By:Daniil Ryabko
Deposited On:09 March 2011