PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

The Wikipedia XML Corpus
Ludovic Denoyer and Patrick Gallinari
SIGIR Forum Volume 40, Number 1, 2006.


Wikipedia1 is a well know free content, multilingual encyclopedia written collab- oratively by contributors around the world. Anybody can edit an article using a wiki markup language that offers a simplified alternative to HTML. This ency- clopedia is composed of millions of articles in different languages. Content-oriented XML retrieval is an area of Information Retrieval (IR) re- search that is receiving an increasing interest. There already exists a very active community in the IR/ XML domain which started to work on XML search en- gines and XML textual data. This community is mainly organized since 2002 around the INEX initiative (INitiative for the Evaluation of XML Retrieval) which is funded by the DELOS network of excellence on Digital Libraries. In this article, we describe a set of XML collections based on Wikipedia. These collections can be used in a large variety of XML IR/Machine Learning tasks like ad-hoc retrieval, categorization, clustering or structure mapping. These corpora are currently used for both, INEX 20062 and the XML Document Mining Challenge3. The article provides a description of the corpus. The collections are downloadable on the website: –

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Information Retrieval & Textual Information Access
ID Code:2796
Deposited By:Ludovic Denoyer
Deposited On:24 March 2009