PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Apprentissage actif pour l'annotation de documents
Loïc Lecerf and Boris Chidlovskii
In: CORIA, 4éme Conférence en Recherche d Information et Applications, 28-30 Mars 2007, , Saint-Etienne, France.

Abstract

In the framework of the {\it LegDoc} project at Xerox Research Centre Europe, we are developing components for the semantic annotation of semi-structured documents. While certain semantic entities have regular forms and might be easily extracted, more complex and heterogeneous collections favor the deployment of machine learning methods. Moreover, real world cases pose the technical challenge of the unavailable training sets for specific annotation tasks. As the manual annotation is costly and error-prone, our approach consists in applying active learning methods in order to considerably reduce the corpus required for accurate learning models. In this paper, %we present our recent progress in the semantic document annotation. In particular, we explain how the active learning principles get adapted the interactive semantic annotation of layout-oriented documents. We deploy the maximum entropy classifier for the probabilistic reasoning and three uncertainty metrics for the efficient application of active learning on large collections. We present the Active Learning Document Annotation Interface ({\it ALDAI}) prototype and describe its functionality and implementation choices. The prototype offers a WYSIWYG interface, a high-level language for feature definitons and integrates the active learning component aimed at helping users during the annotation process. We also report some evaluation results of testing the active learning techniques on one public (UCI) and one internal document collections.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:3040
Deposited By:Boris Chidlovskii
Deposited On:16 September 2007