PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Mining XML Documents
L. Candillier, Ludovic Denoyer, Patrick Gallinari, M. C. Rousset, A. Termier and A. N. Vercoustre
In: Data Mining Patterns : New Methods and Applications (2007) Idea Group Inc. .


XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections.

EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Information Retrieval & Textual Information Access
ID Code:3654
Deposited By:Ludovic Denoyer
Deposited On:14 February 2008