Part-of-speech enhanced context recognition
Rasmus Elsborg Madsen, Jan Larsen and Lars Kai Hansen
In: IEEE Workshop on Machine Learning for Signal Processing XIV, 29 Sep - 01 Oct 2004, São Luís, Brazil.
Language independent `bag-of-words' representations
are surprisingly efective for text classi¯cation. In this communi-
cation our aim is to elucidate the synergy between language inde-
pendent features and simple language model features. We consider
term tag features estimated by a so-called part-of-speech tagger.
The feature sets are combined in an early binding design with an
optimized binding coefficient that allows weighting of the relative
variance contributions of the participating feature sets. With the
combined features documents are classi¯ed using a latent semantic
indexing representation and a probabilistic neural network classi-
fier. Three medium size data-sets are analyzed and we find consis-
tent synergy between the term and natural language features in all
three sets for a range of training set sizes. The most significant en-
hancement is found for small text databases where high recognition
rates are possible.