PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

An MDL Framework for Data Clustering
Petri Kontkanen, Petri Myllymäki, Wray Buntine, Jorma Rissanen and Henry Tirri
In: Advances in Minimum Description Length: Theory and Applications (2005) MIT Press .


We regard clustering as a data assignment problem where the goal is to partition the data into several non-hierarchical groups of items. For solving this problem, we suggest an information-theoretic framework based on the minimum description length (MDL) principle. Intuitively, the idea is that we group together those data items that can be compressed well together, so that the total code length over all the data groups is optimized. One can argue that as efficient compression is possible only when one has discovered underlying regularities that are common to all the members of a group, this approach produces an implicitly defined similarity metric between the data items. Formally the global code length criterion to be optimized is defined by using the intuitively appealing universal normalized maximum likelihood code which has been shown to produce optimal compression rate in an explicitly defined manner. The number of groups can be assumed to be unknown, and the problem of deciding the optimal number is formalized as part of the same theoretical framework. In the empirical part of the paper we present results that demonstrate the validity of the suggested clustering framework.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Theory & Algorithms
ID Code:118
Deposited By:Petri Myllymäki
Deposited On:27 May 2004