PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Sample Selection for Statistical Parsers: Cognitively Driven Algorithms and Evaluation Measures
Roi Reichart and Ari Rappoport
In: CoNLL 2009(2009).


Creating large amounts of manually annotated training data for statistical parsers imposes heavy cognitive load on the human annotator and is thus costly and error prone. It is hence of high importance to decrease the human efforts involved in creating training data without harming parser performance. For constituency parsers, these efforts are traditionally evaluated using the total number of constituents (TC) measure, assuming uniform cost for each annotated item. In this paper, we introduce novel measures that quantify aspects of the cognitive efforts of the human annotator that are not reflected by the TC measure, and show that they are well established in the psycholinguistic literature. We present a novel parameter based sample selection approach for creating good samples in terms of these measures. We describe methods for global optimisation of lexical parameters of the sample based on a novel optimisation problem, the constrained multiset multicover problem, and for cluster-based sampling according to syntactic parameters. Our methods outperform previously suggested methods in terms of the new measures, while maintaining similar TC performance.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Additional Information:This paper won the Best Paper Award at CoNLL 2009.
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
User Modelling for Computer Human Interaction
Learning/Statistics & Optimisation
Natural Language Processing
ID Code:5568
Deposited By:Ari Rappoport
Deposited On:04 March 2010