PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Unsupervised Morpheme Analysis Evaluation by IR experiments -- Morpho Challenge 2008
Mikko Kurimo and Ville Turunen
In: Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark(2008).


This paper presents the evaluation and results of Competition 2 (information retrieval experiments) in the Morpho Challenge 2008. Competition 1 (a comparison to linguistic gold standard) is described in a companion paper. In Morpho Challenge 2008 the goal was to search and evaluate unsupervised machine learning algorithms that provide morpheme analysis for words in different languages. The morpheme analysis can be important in several applications, where a large vocabulary is needed. Especially in morphologically complex languages, such as Finnish, Turkish and Arabic, the agglutination, inflection, and compounding easily produces millions of different word forms which is clearly too much for building an effective vocabulary and training probabilistic models for the relations between words. The benefits of successful morpheme analysis can be seen, for example, in speech recognition, information retrieval, and machine translation. In Morpho Challenge 2008 the morpheme analysis submitted by the Challenge participants were evaluated by performing information retrieval experiments, where the words in the documents and queries were replaced by their proposed morpheme representations and the search was based on morphemes instead of words. The results indicate that the morpheme analysis has a significant effect in IR performance in all tested languages (Finnish, English and German). The best unsupervised and language-independent morpheme analysis methods can also rival the best language-dependent word normalization methods. The Morpho Challenge was part of the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with CLEF.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Natural Language Processing
Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:4304
Deposited By:Mikko Kurimo
Deposited On:13 March 2009