PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Assessing data mining results via swap randomization
Aristides Gionis, Heikki Mannila, Taneli Mielikäinen and Panayiotis Tsaparas
In: KDD 2006, 20-23 Aug 2006, Philadelphia, PA, USA.


The problem of assessing the significance of data mining results on high-dimensional 0--1 data sets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by, e.g., chi-square tests, or many other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are more difficult to apply to sets of patterns or other complex results of data mining. In this paper, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins with the given dataset, computing the results of interest on the randomized instances, and comparing them against the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and rankings. To generate random datasets with given margins, we use variations of a Markov chain approach, which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is a random artifact, while for other datasets the discovered structure conveys meaningful information.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:2132
Deposited By:Taneli Mielikäinen
Deposited On:15 July 2006