PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text
Mathias Creutz and Krista Lagus
In: International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning, 15-17 Jun 2005, Espoo, Finland.


This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. The induced morph lexicon stores parameters related to both the ``meaning'' and ``form'' of the morphs it contains. These parameters affect the role of the morphs in words. The model is implemented in a task of unsupervised morpheme segmentation of Finnish and English words. Very good results are obtained for Finnish and almost as good results are obtained in the English task.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Talk)
Additional Information:Pp. 106-113.
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:1676
Deposited By:Krista Lagus
Deposited On:28 November 2005