Corpus-Based vs. Model-Based Selection of Relevant Features
Cyril Goutte, Pavel Dobrokhotov, Eric Gaussier and Anne-Lise Veuthey
In: CORIA'04, March 10-12, 2004, Toulouse, France.
In this contribution, we review a number of approaches to feature
selection, divided in two broad classes. Some are corpus-based, ie
they use only the data to assess the relevance of each feature, and
aim at identifying a small subset of relevant features on which to
train categorisation models. Others are model-based, ie they assess
the relevance of each feature on the basis of the model used for
categorisation. This second class of measures allows to better
understand the model decisions. Furthermore, comparing the two classes
provide insight on whether or not corpus-based feature extraction is
selective enough, and does not overgenerate compared to model-based
selection. Our experimental comparison is mainly based on a collection
of medical abstracts, provided by the Swiss Institute of Bioinformatics.