PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Probabilistic Model for Structured Document Mapping.
Guillaume Wisniewski, Ludovic Denoyer and Patrick Gallinari
In: MLDM 2007(2007).


We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machine learning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. This model relies on the concept of meta-document which is a latent variable providing a link between input and target documents. It allows us to learn the correspondences when the input documents are expressed in a large variety of schemas. We then detail an instance of the general model for the particular task of HTML to XML conversion. This instance is tested on three different corpora using two different inference methods: a dynamic programming method and an approximate LaSO-based method.

EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Learning/Statistics & Optimisation
Information Retrieval & Textual Information Access
ID Code:3660
Deposited By:Ludovic Denoyer
Deposited On:14 February 2008