PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Using Domain Similarity for Performance Estimation
Vincent Van Asch and Walter Daelemans
Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP) pp. 31-36, 2010. ISSN 978-1-932432-80-0


Many natural language processing (NLP) tools exhibit a decrease in performance when they are applied to data that is linguistically different from the corpus used during development. This makes it hard to develop NLP tools for domains for which annotated corpora are not available. This paper explores a number of metrics that attempt to predict the cross-domain performance of an NLP tool through statistical inference. We apply different similarity metrics to compare different domains and investigate the correlation between similarity and accuracy loss of NLP tool. We find that the correlation between the performance of the tool and the similarity metric is linear and that the latter can therefore be used to predict the performance of an NLP tool on out-of-domain data. The approach also provides a way to quantify the difference between domains.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:7014
Deposited By:Vincent Van Asch
Deposited On:08 October 2010