PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

A RULE BASED APPROACH TO WORD LEMMATIZATION
Joel Plisson, Nada Lavrac and Dunja Mladenić
In: SiKDD 2004 at multiconference IS-2004, 12-15 October, Ljubljana, Slovenia.

Abstract

Lemmatization is the process of finding the normalized form of a word. It is the same as looking for a transformation to apply on a word to get its normalized form. The approach presented in this paper focuses on word endings: what word suffix should be removed and/or added to get the normalized form. This paper compares the results of two word lemmatization algorithms, one based on if-then rules and the other based on ripple down rules induction algorithms. It presents the problem of lemmatization of words from Slovene free text and explains why the Ripple Down Rules (RDR) approach is very well suited for the task. When learning from a corpus of lemmatized Slovene words the RDR approach results in easy to understand rules of improved classification accuracy compared to the results of rule learning achieved in previous work

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:715
Deposited By:Dunja Mladenić
Deposited On:30 December 2004