PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Productive Generation of CompoundWords in Statistical Machine Translation
Sara Stymne and Nicola Cancedda
In: EMNLP 2011 Sixth Workshop on Statistical Machine Translation, 30-31 July 2011, Edinburgh, UK.


In many languages the use of compound words is very productive. A common practice to reduce sparsity consists in splitting compounds in the training data. When this is done, the system incurs the risk of translating components in non-consecutive positions, or in the wrong order. Furthermore, a post-processing step of compound merging is required to reconstruct compound words in the output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order. We also propose new heuristic methods for merging components that outperform all known methods, and a learning-based method that has similar accuracy as the heuristic method, is better at producing novel compounds, and can operate with no background linguistic resources.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Poster)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:8875
Deposited By:Nicola Cancedda
Deposited On:21 February 2012