Productive Generation of CompoundWords in Statistical Machine Translation
Sara Stymne and Nicola Cancedda
In: EMNLP 2011 Sixth Workshop on Statistical Machine Translation, 30-31 July 2011, Edinburgh, UK.
In many languages the use of compound words is very productive. A common practice to reduce sparsity consists in splitting compounds in the training data. When this is done, the system incurs the risk of translating components in non-consecutive positions, or in the wrong order. Furthermore, a post-processing step of compound merging is required to reconstruct compound words in the output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order. We also propose new heuristic methods for merging components that outperform all known methods, and a learning-based method that has similar accuracy as the heuristic method, is better at producing novel compounds, and can operate with no background linguistic resources.