Learning Interestingness Measures in Terminology Extraction. A ROC-based approach
In the field of Text Mining, a key phase in data preparation is concerned with the extraction of terms, i.e. collocation of words attached to specific concepts (e.g. Philosophy-Dissertation). In this paper, Term Extraction is formalized as a supervised learning task, extracting a ranking hypothesis from a set of terms labeled as relevant/irrelevant by the expert. This task is tackled using the evolutionary algorithm ROGER, optimizing the area under the ROC curve attached to a ranking hypothesis. Empirical validation on two real-world applications demonstrates outstanding improvements compared to state-of-the-art interestingness measures in Term Extraction. The approach is found robust across domains (Molecular Biology, Curriculum Vitae) and languages (English, French).