PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Intégration de la construction de la terminologie de domaines spécialisés dans un processus global de fouille de textes
Mathieu Roche
(2004) PhD thesis, Université Paris-Sud.

Abstract

Information extraction from specialized texts requires the application of a complete process of text mining. One of the steps of this process is term detection. The terms are defined as groups of words representing a linguistic instance of some user-defined concept. For example, the term "data mining" evokes the concept of "computational technique". Initially, the task of terminology acquisition consists in extracting groups of words instanciating simple syntactic patterns such as Noun-Noun, Adjective-Noun, etc. One specificity of our algorithm is its iterative mode used to build complex terms. For example, if at the first iteration the Noun-Noun term "data mining" is found, at the following step the term "data-mining application" can be obtained. Moreover, with EXIT (Iterative EXtraction of the Terminology) the expert stands at the center of the terminology extraction process and he can intervene throughout the process. In addition to the iterative aspect of the system, many parameters were added. One of these parameters makes possible the use of various statistical criteria to classify the terms according to their relevance for a task to achieve. Our approach was validated with four corpora of different languages and size, and different fields of specialty. Lastly, a method based on a supervised machine learning approach is proposed in order to improve the quality of the obtained terminology.

EPrint Type:Thesis (PhD)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
ID Code:1809
Deposited By:Mathieu Roche
Deposited On:28 November 2005