PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Part-of-speech enhanced context recognition
Rasmus Elsborg Madsen, Jan Larsen and Lars Kai Hansen
In: IEEE Workshop on Machine Learning for Signal Processing XIV, 29 Sep - 01 Oct 2004, São Luís, Brazil.

Abstract

Language independent `bag-of-words' representations are surprisingly efective for text classi¯cation. In this communi- cation our aim is to elucidate the synergy between language inde- pendent features and simple language model features. We consider term tag features estimated by a so-called part-of-speech tagger. The feature sets are combined in an early binding design with an optimized binding coefficient that allows weighting of the relative variance contributions of the participating feature sets. With the combined features documents are classi¯ed using a latent semantic indexing representation and a probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition rates are possible.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Talk)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:839
Deposited By:Rasmus Elsborg Madsen
Deposited On:01 January 2005