|
Contextualized Text Classification Using Wikipedia AbstractA bag-of-words (BoW) is possibly the most prominent representation of text for multiple reasons: it is easy to use, and it leads to sparse and compact representations of documents, and to reasonable results in many scenarios. However, bag-of-words will inevitably lead to poor representations when there are only a few relevant terms in the document. In this paper, we propose a mapping to contextualize documents onto Wikipedia to identify additional relevant terms, capturing the context of the document. Given a document, our approach generates a ranked list of all terms in the knowledge base and augments the top-$k$ to the initial BoW representation. Our method is based on co-citation, solely relies on the link structure of Wikipedia and can be computed very efficiently. We show empirically on Reuters21578, 20-Newsgroups, and an IMDB movie data set that the predictive performance of state-of-the-art text classifiers is significantly increased for the contextual representation. Our approach consistently outperforms the bag-of-words baseline.
[Edit] |