Finding interesting itemsets using a probabilistic model for binary databases
A good formalization of interestingness of a pattern should satisfy two criteria: it should conform well to intuition, and it should be computationally tractable to use. The focus has long been on the latter, with the development of frequent pattern mining methods. However, it is now recognized that more appropriate measures than frequency are required. In this paper we report results in this direction for itemset mining in binary databases. In particular, we introduce a probabilistic model that can be fitted efficiently to any binary database, and that has a compact and explicit representation. We then show how this model enables the formalization of an intuitive and tractable interestingness measure for itemsets, relying on concepts from information theory. Our probabilistic model is closely related to the uniform distribution over all databases that can be obtained by means of swap randomization . However, in contrast to the swap randomization model, our model is explicit, which is key to its use for defining practical interestingness measures.