PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Corpus-Based vs. Model-Based Selection of Relevant Features
Cyril Goutte, Pavel Dobrokhotov, Eric Gaussier and Anne-Lise Veuthey
In: CORIA'04, March 10-12, 2004, Toulouse, France.


In this contribution, we review a number of approaches to feature selection, divided in two broad classes. Some are corpus-based, ie they use only the data to assess the relevance of each feature, and aim at identifying a small subset of relevant features on which to train categorisation models. Others are model-based, ie they assess the relevance of each feature on the basis of the model used for categorisation. This second class of measures allows to better understand the model decisions. Furthermore, comparing the two classes provide insight on whether or not corpus-based feature extraction is selective enough, and does not overgenerate compared to model-based selection. Our experimental comparison is mainly based on a collection of medical abstracts, provided by the Swiss Institute of Bioinformatics.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Additional Information:
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:548
Deposited By:Cyril Goutte
Deposited On:25 December 2004