A lemmatization web service based on machine learning techniques
Lemmatization is the process of finding the normalized form of words from surface word-forms as they appear in the running text. It is a useful pre-processing step for any number of language engineering tasks, esp. important for languages with rich inflection morphology. This paper presents two approaches to automated word lemmatization, which both use machine learning techniques to learn particular language models from pre-annotated data. One approach is based on Ripple Down Rules and the other on First-Order Decision Lists as learned by the CLog system. We have tested the two approaches on the Slovene language and set-up a generally accessible Web service for lemmatization using the generated models.