PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Contextualized Text Classification Using Wikipedia
Pascal Lehwark, Ulf Brefeld, Mikio braun and Klaus-Robert Müller
In: KDD 2010: 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, July 25-28, 2010, Washington, DC, USA.


A bag-of-words (BoW) is possibly the most prominent representation of text for multiple reasons: it is easy to use, and it leads to sparse and compact representations of documents, and to reasonable results in many scenarios. However, bag-of-words will inevitably lead to poor representations when there are only a few relevant terms in the document. In this paper, we propose a mapping to contextualize documents onto Wikipedia to identify additional relevant terms, capturing the context of the document. Given a document, our approach generates a ranked list of all terms in the knowledge base and augments the top-$k$ to the initial BoW representation. Our method is based on co-citation, solely relies on the link structure of Wikipedia and can be computed very efficiently. We show empirically on Reuters21578, 20-Newsgroups, and an IMDB movie data set that the predictive performance of state-of-the-art text classifiers is significantly increased for the contextual representation. Our approach consistently outperforms the bag-of-words baseline.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
Theory & Algorithms
ID Code:6451
Deposited By:Mikio braun
Deposited On:08 March 2010