PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Models and Techniques in Probabilistic Grammatical Inference: Dealing with Noisy Data and Knowledge Discovery
Amaury Habrard
(2004) PhD thesis, University of Saint-Etienne.

Abstract

Probabilistic grammatical inference is a subtopic of machine learning which aims at learning probabilistic finite automata. In this thesis, we focus on two main problems of this research field: the processing of noisy and irrelevant data, and knowledge discovery from tree-structured data. In the context of dealing with corrupted data, we adopt a pragmatic standpoint by presenting some approaches directly applicable on real world problems. On the one hand, we focus on \emph{filter} methods allowing to detect and process noisy and irrelevant data before the learning phase. On the other hand, we propose an \emph{embedded} approach which aims at limiting the impact of noisy data during the inference of automata. Our methods can be applied on automata built from either sequences or trees. Our knowledge discovery approach is based on a generalization of stochastic tree automata. These ``generalized'' automata allow us to extract tree patterns from any stochastic tree automata. Our method can be applied not only on tree-structured data, but also on relational databases thanks to a technique generating trees from such structures. We show an application of our approach on a real medical relational database. Keywords. Probabilistic grammatical inference, stochastic automata, noisy and irrelevant data, stochastic tree automata, multi-relational data mining.

EPrint Type:Thesis (PhD)
Additional Information:The document is written in french
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
Theory & Algorithms
ID Code:324
Deposited By:Amaury Habrard
Deposited On:10 December 2004