PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Modèle probabiliste pour l'extraction de structures dans les documents semi-structurés : Application aux documents Web
Ludovic Denoyer, Patrick Gallinari and Francis Maes
In: CORIA 2006, France(2006).


With content management system becoming mainstream the Web has changed dramatically: more and more web pages are now generated from relational databases and their design reflects the logical structure of documents. In this work, we show that there is enough information in the layout of a web document to capture the kind of data people are already producing in a more machine-friendly format. The extraction of a semantic structure from the layout of documents faces two main obstacles: structures are heterogeneous — they change with the source producing it — and often remain implicit. We introduce a general stochastic model of semi structured documents generation and transformation and detail an instance of this model for the particular task of HTML to XML conversion.

EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Information Retrieval & Textual Information Access
ID Code:2797
Deposited By:Ludovic Denoyer
Deposited On:24 March 2009