Nearest Neighbor Clustering: A Baseline Method for Consistent
Clustering with Arbitrary Objective Functions
Sébastien Bubeck and Ulrike v. Luxburg
Clustering is often formulated as a discrete optimization problem. The objective is to find, among
all partitions of the data set, the best one according to some quality measure. However, in the statistical
setting where we assume that the finite data set has been sampled from some underlying
space, the goal is not to find the best partition of the given sample, but to approximate the true
partition of the underlying space. We argue that the discrete optimization approach usually does
not achieve this goal, and instead can lead to inconsistency. We construct examples which provably
have this behavior. As in the case of supervised learning, the cure is to restrict the size of
the function classes under consideration. For appropriate “small” function classes we can prove
very general consistency theorems for clustering optimization schemes. As one particular algorithm
for clustering with a restricted function space we introduce “nearest neighbor clustering”.
Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a
general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is
statistically consistent for all commonly used clustering objective functions.