Using PCA for Probabilistic Grammatical Inference on Trees
François Denis, Raphael Bailly, Edouard Gilbert and Amaury Habrard
In: NIPS 2009 workshop on Grammar Induction, Representation of Language and Language Learning, Whistler, Canada(2009).
We focus on the classical problem in grammatical inference of learning stochastic tree languages from
finite samples of trees independently drawn according to a fixed
distribution. We consider here the class of stochastic tree languages that
can be computed by rational tree series which can be viewed as a
of probabilistic tree automata. The class of rational stochastic tree
has an algebraic characterization: All the residuals of a stochastic
lie in a finite vector subspace. We propose a principle based on
Principal Components Analysis to identify this vector subspace.
This approach allows us to define a global solution of the problem
instead of building an automaton iteratively as done by standard
grammatical inference algorithm. This is a way to tackle the main
drawback of these approaches that is using statistical tests
that rely on less and less examples when the structure grows.
We provide an algorithm that computes an estimate of the target
vector subspace and build a linear representation of a tree series
an estimation of the target distribution. We notably show that in the
case of tree languages, we have to consider the dual vector subspace to
build the representation.