|
Models and Techniques in Probabilistic Grammatical Inference: Dealing with Noisy Data and Knowledge Discovery AbstractProbabilistic grammatical inference is a subtopic of machine learning which aims at learning probabilistic finite automata. In this thesis, we focus on two main problems of this research field: the processing of noisy and irrelevant data, and knowledge discovery from tree-structured data. In the context of dealing with corrupted data, we adopt a pragmatic standpoint by presenting some approaches directly applicable on real world problems. On the one hand, we focus on \emph{filter} methods allowing to detect and process noisy and irrelevant data before the learning phase. On the other hand, we propose an \emph{embedded} approach which aims at limiting the impact of noisy data during the inference of automata. Our methods can be applied on automata built from either sequences or trees. Our knowledge discovery approach is based on a generalization of stochastic tree automata. These ``generalized'' automata allow us to extract tree patterns from any stochastic tree automata. Our method can be applied not only on tree-structured data, but also on relational databases thanks to a technique generating trees from such structures. We show an application of our approach on a real medical relational database. Keywords. Probabilistic grammatical inference, stochastic automata, noisy and irrelevant data, stochastic tree automata, multi-relational data mining.
[Edit] |