Apprentissage et inférence statistique dans les bases de
documents structurés :
Application aux corpus de documents textuels
Machine learning and structured data : application to categorization, clustering and
automatic mapping of XML documents.
PhD thesis, LIP6 - University of Paris 6.
With the development … of new data formats like XML and HTML, the domain of
Information Retrieval (IR) has considerably evolved. The concept of “piece of information” is
now completely different and we have to first adapt classical models of IR to structured data
and we must also study the new tasks emerging from such a type of documents.
In our work, we study three main problems: categorization and clustering are two classical
IR tasks and document mapping is a new semi-structured specific problem. We first propose
a general class of probabilistic models for structured data. We explain a model which is able
to handle flat documents which are thematically heterogeneous. We then present a generative
model based on belief network formalism for tree structured data (like XML). It takes into
account the structural and content information. We explain how this model can be used for
categorization, clustering and automatic document mapping.