PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Morfessor and Hutmegs: unsupervised morpheme segmentation for highly-inflecting and compounding languages
Mathias Creutz, Krista Lagus, Krister Linden and Sami Virpioja
In: Second Baltic Conference on HUMAN LANGUAGE TECHNOLOGIES, 4-6 Apr 2005, Tallinn, Estonia.

Abstract

In this work, we announce the Morfessor 1.0 software package, which is a program that takes as input a corpus of raw text and produces a segmentation of the word forms observed in the text. The segmentation obtained often resembles a linguistic morpheme segmentation. In addition, we briefly describe the Hutmegs package, also publicly available for research purposes. Hutmegs contains semi-automatically produced correct, or gold-standard, morpheme segmentations for a large number of Finnish and English word forms. One easy way for the reader to familiarize himself with our work is to test the demonstration program on our Internet site. The demo shows how Morfessor segments words that the user types in.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
Natural Language Processing
ID Code:1841
Deposited By:Mathias Creutz
Deposited On:29 November 2005