An information theoretic approach to finding interesting patterns in databases
Masters thesis, University of Bristol.
Approximate frequent itemset (AFI) mining in binary databases is an generalisation of the well known task of finding exact frequent itemsets. Recently, efficient algorithms, which deal with large databases, have been developed, theoretical results have been established and properties of the search space for AFI have been discovered. However the output of these algorithms are often a huge in size, hard to understand, highly redundant and possibly trivial set of itemsets. Recent research (,) has been focused in assessing the interestingness of these results. In this project the method proposed in  is extended in order to include noisy itemsets. This algorithm computes the Maximum Entropy Model for the database and establishes a trade of between the informativeness and the description length of an exact itemset based on this model (Information Ratio measure). In this project two new variants of the Information Ratio for noisy itemsets measure are proposed. Theoretical analysis of the behaviour of the measures is peformed. Results on artificially generated data shows that the proposed measures scale well with noise and are statisticaly significant. In addition the method using Information Ratio rankings can recover efficiently an interesting tile and discard itemsets produced by chance or noise. Results on “real-world” text datasets showed that the outputed set of itemsets consists of large, not redundant and easy to interpret itemsets, which present the main topics of the mined text.