Un modèle Statistique pour la classification de Documents Structurés
Trang Vu, Ludovic Denoyer and Patrick Gallinari
In: EGC 2003, Lyon, France(2003).
We present a learning model for categorization of structured documents that takes into account both structural information and textual information. We first define a generative model of structured documents using belief networks. Then we transform the generative model into a discriminative one using the Fisher kernel. Finally, we describe an instance of this model applied to the categorization of HTML documents. The experimental application to a classical corpus shows that the use of structural information outperforms other classical models.