PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Redundancy, deduction schemes, and minimum-size bases for association rules
José Balcázar
(2008) Technical Report. None.

Abstract

Association rules are among the most widely employed data analysis methods in the field of Data Mining. An association rule is a form of partial implication between two sets of binary variables. In the most common approach, association rules are usually parameterized by a lower bound on their confidence, which is the empirical conditional probability of their consequent given the antecedent, and/or by some other parameter bounds such as "support'' or deviation from independence. We study here notions of redundancy among association rules from a fundamental perspective. Several existing such notions look like "any dataset in which this first rule holds must obey also that second rule, therefore the second is redundant''; if we see each transaction in a dataset as an interpretation (or model) in the propositional logical sense, whence datasets correspond to theories, such a notion of redundancy is, actually, a form of logical entailment. In many logics, entailment has a precise syntactic counterpart in the form of a deduction calculus. Here we discuss several existing alternative definitions of redundancy and provide new characterizations and relationships among them. We show that the main alternatives we discuss correspond actually to just two variants, which differ in the treatment of full-confidence implications. For each of these two notions of redundancy, we provide a sound and complete deduction calculus, and we show how to construct complete bases (that is, axiomatizations) of absolutely minimum size in terms of the number of rules. We also describe some issues arising from the practical applicability of our proposal, and discuss briefly the relationship of our approach with other existing notions of redundancy. Reduced versions of the results in sections 3.1, 4.2, 4.3, and 5 have been presented at Discovery Science 2008 [5]; reduced versions of the remaining results (except the unpublished results in section 4.6) have been presented at ECMLPKDD 2008 [4].

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Monograph (Technical Report)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Theory & Algorithms
ID Code:4259
Deposited By:José Balcázar
Deposited On:22 January 2009