## AbstractThe problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic no- tion of consistency, and show that simple con- sistent algorithms exist, under most general non-parametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they were generated by the same distribu- tion. With this notion of consistency, cluster- ing generalizes such classical statistical prob- lems as homogeneity testing and process clas- sification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary er- godic (no parametric or Markovian assump- tions, no assumptions of independence, nei- ther between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assump- tions on the mixing rates of the processes. In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent.
[Edit] |