|
Apprentissage et inférence statistique dans les bases de
documents structurés :
Application aux corpus de documents textuels
====
Machine learning and structured data : application to categorization, clustering and
automatic mapping of XML documents. AbstractWith the development … of new data formats like XML and HTML, the domain of Information Retrieval (IR) has considerably evolved. The concept of “piece of information” is now completely different and we have to first adapt classical models of IR to structured data and we must also study the new tasks emerging from such a type of documents. In our work, we study three main problems: categorization and clustering are two classical IR tasks and document mapping is a new semi-structured specific problem. We first propose a general class of probabilistic models for structured data. We explain a model which is able to handle flat documents which are thematically heterogeneous. We then present a generative model based on belief network formalism for tree structured data (like XML). It takes into account the structural and content information. We explain how this model can be used for categorization, clustering and automatic document mapping.
[Edit] |