PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Soundex-based translation correction in Urdu-English cross-language information retrieval
Manaal Faruqui, Prasenjit Majumder and Sebastian Pado
In: Fifth International Workshop On Cross Lingual Information Access, Chiang Mai, Thailand(2011).


Cross-language information retrieval is difficult for languages with few processing tools or resources such as Urdu. An easy way of translating content words is provided by Google Translate, but due to lexicon limitations named entities (NEs) are transliterated letter by letter. The resulting NEs errors (zynydyny zdn for Zinedine Zidane) hurts retrieval. We propose to replace English non-words in the translation output. First, we determine phonetically similar English words with the Soundex algorithm. Then, we choose among them by a modified Levenshtein distance that models correct transliteration patterns. This strategy yields an improvement of 4% MAP (from 41.2 to 45.1, monolingual 51.4) on the FIRE-2010 dataset.

EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:8486
Deposited By:Sebastian Pado
Deposited On:16 February 2012