Morphology-Aware Statistical Machine Translation Based on Morphs Induced in an Unsupervised Manner
Sami Virpioja, Jaakko J. Väyrynen, Mathias Creutz and Markus Sadeniemi
In: Machine Translation Summit XI, 10 - 14 Sep 2007, Copenhagen, Denmark.
In this paper, we apply a method of unsupervised morphology learning to a state-of-the-art phrase-based statistical machine translation
(SMT) system. In SMT, words are traditionally used as the smallest units of translation. Such a system generalizes poorly to word forms
that do not occur in the training data. In particular, this is problematic for languages that are highly compounding, highly inflecting, or
both. An alternative way is to use sub-word units, such as morphemes. We use the Morfessor algorithm to find statistical morphemelike
units (called morphs) that can be used to reduce the size of the lexicon and improve the ability to generalize. Translation and
language models are trained directly on morphs instead of words. The approach is tested on three Nordic languages (Danish, Finnish,
and Swedish) that are included in the Europarl corpus consisting of the Proceedings of the European Parliament. However, in our
experiments we did not obtain higher BLEU scores for the morph model than for the standard word-based approach. Nonetheless, the
proposed morph-based solution has clear benefits, as morphologically well motivated structures (phrases) are learned, and the proportion
of words left untranslated is clearly reduced.