PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Stochastic Models for Document Restructuration
Patrick Gallinari, guillaume Wisniewski, Francis MAES and Ludovic Denoyer
In: ECML Workshop On Relational Machine Learning, 3-7 Oct 2005, Porto, Portugal.


Document (re)structuration consists in mapping documents coming from different sources, with different formats, onto a predefined semi-structured format. This generic problem appears in different applications settings like het-erogeneous semi-structured databases querying, peer to peer systems, legacy document conversion, XML information retrieval. In the paper, we define the restructuration problem from a document centric perspective and identify the main problems raised by this new problematic. We then consider two restructu-ration instances: structuring flat documents and learning the correspondence be-tween structured formats. We propose stochastic models for these two tasks and describe tests on a large XML document collection.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:183
Deposited By:Ludovic Denoyer
Deposited On:28 November 2005