A RULE BASED APPROACH TO WORD LEMMATIZATION
Joel Plisson, Nada Lavrac and Dunja Mladenić
In: SiKDD 2004 at multiconference IS-2004, 12-15 October, Ljubljana, Slovenia.
Lemmatization is the process of finding the normalized form of a word. It is the same as looking for a transformation to apply on a word to get its normalized form. The approach presented in this paper focuses on word endings: what word suffix should be removed and/or added to get the normalized form. This paper compares the results of two word lemmatization algorithms, one based on if-then rules and the other based on ripple down rules induction algorithms. It presents the problem of lemmatization of words from Slovene free text and explains why the Ripple Down Rules (RDR) approach is very well suited for the task. When learning from a corpus of lemmatized Slovene words the RDR approach results in easy to understand rules of improved classification accuracy compared to the results of rule learning achieved in previous work