Stochastic Models for Document Restructuration
Patrick Gallinari, guillaume Wisniewski, Francis MAES and Ludovic Denoyer
In: ECML Workshop On Relational Machine Learning, 3-7 Oct 2005, Porto, Portugal.
Document (re)structuration consists in mapping documents coming from different sources, with different formats, onto a predefined semi-structured format. This generic problem appears in different applications settings like het-erogeneous semi-structured databases querying, peer to peer systems, legacy document conversion, XML information retrieval. In the paper, we define the restructuration problem from a document centric perspective and identify the main problems raised by this new problematic. We then consider two restructu-ration instances: structuring flat documents and learning the correspondence be-tween structured formats. We propose stochastic models for these two tasks and describe tests on a large XML document collection.