Contextualized Text Classification Using Wikipedia
Pascal Lehwark, Ulf Brefeld, Mikio braun and Klaus-Robert Müller
In: KDD 2010: 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, July 25-28, 2010, Washington, DC, USA.
A bag-of-words (BoW) is possibly the most prominent
representation of text for multiple reasons: it is easy to use,
and it leads to sparse and compact representations of documents,
and to reasonable results in many scenarios. However, bag-of-words will
inevitably lead to
poor representations when there are only a few relevant terms in the
document. In this paper, we propose a mapping to contextualize documents
onto Wikipedia to identify additional relevant terms, capturing
the context of the document. Given a document, our approach generates a
ranked list of all terms in the knowledge base and augments the
top-$k$ to the initial BoW representation.
Our method is based on co-citation, solely relies on the
link structure of Wikipedia and can be computed very efficiently.
We show empirically on Reuters21578, 20-Newsgroups, and an IMDB movie data set
that the predictive performance of state-of-the-art text classifiers
is significantly increased for the contextual representation. Our approach
consistently outperforms the bag-of-words baseline.