PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data
Amaury Habrard, Marc Bernard and Marc Sebban
fundamenta informaticae Volume 66, Number 1-2, pp. 103-130, 2005.


In front of the large increase of the available amount of structured data (such as XML documents), many algorithms have emerged for dealing with tree-structured data. In this article, we present a probabilistic approach which aims at a priori pruning noisy or irrelevant subtrees in a set of trees. The originality of this approach, in comparison with classic data reduction techniques, comes from the fact that only a part of a tree ( i.e. a subtree) can be deleted, rather than the whole tree itself. Our method is based on the use of confidence intervals, on a partition of subtrees, computed according to a given probability distribution. We propose an original approach to assess these intervals on tree-structured data and we experimentally show its interest in the presence of noise. Keywords. data reduction, tree-structured data, noisy data, stochastic tree automata.

Postscript - PASCAL Members only - Requires a viewer, such as GhostView
EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
ID Code:90
Deposited By:Amaury Habrard
Deposited On:18 May 2004