PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

An information theoretic approach to finding interesting patterns in databases
Kleanthis Kontonasios
(2009) Masters thesis, University of Bristol.

Abstract

Approximate frequent itemset (AFI) mining in binary databases is an generalisation of the well known task of finding exact frequent itemsets. Recently, efficient algorithms, which deal with large databases, have been developed, theoretical results have been established and properties of the search space for AFI have been discovered. However the output of these algorithms are often a huge in size, hard to understand, highly redundant and possibly trivial set of itemsets. Recent research ([1],[3]) has been focused in assessing the interestingness of these results. In this project the method proposed in [1] is extended in order to include noisy itemsets. This algorithm computes the Maximum Entropy Model for the database and establishes a trade of between the informativeness and the description length of an exact itemset based on this model (Information Ratio measure). In this project two new variants of the Information Ratio for noisy itemsets measure are proposed. Theoretical analysis of the behaviour of the measures is peformed. Results on artificially generated data shows that the proposed measures scale well with noise and are statisticaly significant. In addition the method using Information Ratio rankings can recover efficiently an interesting tile and discard itemsets produced by chance or noise. Results on “real-world” text datasets showed that the outputed set of itemsets consists of large, not redundant and easy to interpret itemsets, which present the main topics of the mined text.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Thesis (Masters)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
ID Code:5648
Deposited By:Kleanthis-Nikolaos Kontonasios
Deposited On:08 March 2010