Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew
Roy Bar Haim, Khalil Sima'an and Yoad Winter
In: ACL 2005 Workshop on Computational Approaches to Semitic Languages, 29 June 2005, Ann Arbor, Michigan, USA.
A major architectural decision in designing a disambiguation model for segmentation and Part-of-Speech (POS) tagging
in Semitic languages concerns the choice of the input-output terminal symbols over which the probability distributions are defined. In this paper we develop a segmenter and a tagger for Hebrew based on Hidden Markov Models (HMMs). We start out from a morphological analyzer and a very small morphologically annotated corpus. We show that a model whose terminal symbols are word segments (=morphemes), is advantageous over a word-level model for the task of POS tagging. However, for segmentation alone, the morpheme-level model has no significant advantage over the word-level model. Error analysis shows that both models are not adequate for resolving a common type of segmentation ambiguity in Hebrew – whether or not a word in a written text is prefixed by a definiteness marker. Hence, we propose a morpheme level model where the definiteness morpheme
is treated as a possible feature of morpheme terminals. This model exhibits the best overall performance, both in POS
tagging and in segmentation. Despite the small size of the annotated corpus available for Hebrew, the results achieved using our best model are on par with recent results on Modern Standard Arabic.