PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Extending the tool, or how to annotate historical language varieties
Cristina Sánchez-Marco, Gemma Boleda and Lluís Padró
In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (2011) Association for Computational Linguistics , pp. 1-9. ISBN 9781937284046


We present a general and simple method to adapt an existing NLP tool in order to enable it to deal with historical varieties of languages. This approach consists basically in expanding the dictionary with the old word variants and in retraining the tagger with a small training corpus. We implement this approach for Old Spanish. The results of a thorough evaluation over the extended tool show that using this method an almost state-of-the-art performance is obtained, adequate to carry out quantitative studies in the humanities: 94.5% accuracy for the main part of speech and 92.6% for lemma. To our knowledge, this is the first time that such a strategy is adopted to annotate historical language varieties and we believe that it could be used as well to deal with other non-standard varieties of languages.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:8889
Deposited By:Gemma Boleda
Deposited On:21 February 2012