## AbstractWe focus on the classical problem in grammatical inference of learning stochastic tree languages from finite samples of trees independently drawn according to a fixed unknown distribution. We consider here the class of stochastic tree languages that can be computed by rational tree series which can be viewed as a strict generalization of probabilistic tree automata. The class of rational stochastic tree languages has an algebraic characterization: All the residuals of a stochastic languages lie in a finite vector subspace. We propose a principle based on Principal Components Analysis to identify this vector subspace. This approach allows us to define a global solution of the problem instead of building an automaton iteratively as done by standard probabilistic grammatical inference algorithm. This is a way to tackle the main drawback of these approaches that is using statistical tests that rely on less and less examples when the structure grows. We provide an algorithm that computes an estimate of the target vector subspace and build a linear representation of a tree series giving an estimation of the target distribution. We notably show that in the case of tree languages, we have to consider the dual vector subspace to build the representation.
[Edit] |