PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Learning to order terms: supervised interestingness measures in terminology extraction
Jérome Azé, Mathieu Roche, Yves Kodratoff and Michele Sebag
International Journal of Computational Intelligence 2004.


Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decision-trees are terms associated to the concept “Machine Learning” ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as interesting/not interesting. From these examples, the Roger algorithm learns a numerical function, inducing some ranking on the collocations. This ranking is optimized using genetic algorithms, maximizing the trade-off between the false positive and true positive rates (Area Under the ROC curve). This approach has been applied, using a particular representation for the word collocations, namely the vector of values corresponding to the standard statistical interestingness measures attached to this collocation. As this representation is general (over corpora and natural languages), generality tests were performed by experimenting the ranking function learned from an English corpus in Biology, onto a French corpus of Curriculum Vitae, and vice versa, showing a good robustness of the approaches compared to the state-of-the-art Support Vector Machine (SVM).

EPrint Type:Article
Additional Information:Text-mining, Terminology Extraction, Evolutionary algorithm, ROC Curve.
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
ID Code:657
Deposited By:Jérome Azé
Deposited On:29 December 2004