Weigh your words: Memory-based lemma-retrieval for Middle Dutch literary texts
However vibrant the field of computational linguistics may be for modern languages, researchers have only recently gained interest in language technology for historic languages. The lack of annotated training data and the huge spelling variation so typical of the data, make it difficult to 'dive' into this kind of material. Moreover, rather basic tasks such as lemmatization -- considered 'solved' for many modern languages -- remain problematic, because existing technologies are hard to transfer to older stages of a language. Medieval Dutch (ca. 1200-1500), the focus language of this contribution, is indeed characterized by a huge amount of variation in spelling and spacing, a phenomenon due to a variety of factors: the absence of a written standard language, the phonologic nature of spelling, allography, clitics, ... In this paper we will focus on retrieving lemma-candidates ('nominees') for tokens by means of a lazy machine learner, trained on the literary part of the corpus-Gysseling (ca. 600K running tokens). When testing such a classifier, one is confronted with a large amount of 'unknown tokens' in the test set (on average 1 out of 20 tokens is unknown). These 'unknowns' fall into two major categories: tokens with lemmas (i.e. class labels) that were observed during training and tokens with lemmas that were not previously encountered (and as such cannot be classified because the class label is unknown to the classifier). Regarding the first category, we will demonstrate that a memory-based token classifier outperforms approved string metrics (such as the Levenshtein distance or Dice coefficient), both in efficiency and effectivity. This lightweighted classification technique is based on similarity metrics that operate on the alternation of consonantal and vocalic clusters in tokens. Regarding the second category (unknown tokens with unknown class labels), it appears that many of these are actually proper names. In order to remedy this lack of prior knowledge, we have turned to an important resource: the REMLT ("Repertory of proper names in Middle Dutch Literary Texts"). This repertory, containing a massive collection of proper names, as well as many of their spelling variants in medieval Dutch literature, was parsed from raw WordPerfect-files to meaningful XML. We will demonstrate the (unsurprisingly positive) effects of including this gazetteer-like information in our training material.