Finding Low-Entropy Sets and Trees from Binary Data
Hannes Heikinheimo, Eino Hinkkanen, Heikki Mannila, Taneli Mielikäinen and Jouni Seppänen
In: KDD 2007, 12-15 Aug 2007, San Jose, CA, USA.
The discovery of subsets with special properties from binary
data has been one of the key themes in pattern discovery.
Pattern classes such as frequent itemsets stress the
co-occurrence of the value 1 in the data. While this choice
makes sense in the context of sparse binary data, it disregards
potentially interesting subsets of attributes that have
some other type of dependency structure.
We consider the problem of finding all subsets of attributes
that have low complexity. The complexity is measured by
either the entropy of the projection of the data on the subset,
or the entropy of the data for the subset when modeled
using a Bayesian tree, with downward or upward pointing
edges. We show that the entropy measure on sets has a
monotonicity property, and thus a levelwise approach can
find all low-entropy itemsets. We also show that the treebased
measures are bounded above by the entropy of the corresponding
itemset, allowing similar algorithms to be used
for finding low-entropy trees. We describe algorithms for
finding all subsets satisfying an entropy condition. We give
an extensive empirical evaluation of the performance of the
methods both on synthetic and on real data. We also discuss
the search for high-entropy subsets and the computation of
the Vapnik-Chervonenkis dimension of the data.