PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Apprentissage et inférence statistique dans les bases de documents structurés : Application aux corpus de documents textuels ==== Machine learning and structured data : application to categorization, clustering and automatic mapping of XML documents.
Ludovic Denoyer
(2004) PhD thesis, LIP6 - University of Paris 6.

Abstract

With the development … of new data formats like XML and HTML, the domain of Information Retrieval (IR) has considerably evolved. The concept of “piece of information” is now completely different and we have to first adapt classical models of IR to structured data and we must also study the new tasks emerging from such a type of documents. In our work, we study three main problems: categorization and clustering are two classical IR tasks and document mapping is a new semi-structured specific problem. We first propose a general class of probabilistic models for structured data. We explain a model which is able to handle flat documents which are thematically heterogeneous. We then present a generative model based on belief network formalism for tree structured data (like XML). It takes into account the structural and content information. We explain how this model can be used for categorization, clustering and automatic document mapping.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Thesis (PhD)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:1447
Deposited By:Ludovic Denoyer
Deposited On:28 November 2005