PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Unlimited vocabulary speech recognition for agglutinative languages
Mikko Kurimo, Antti Puurula, Ebru Arisoy, Vesa Siivola, Teemu Hirsimäki, Janne Pylkkönen, Tanel Alumäe and Murat Saraclar
In: Human Language Technology, Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2006, June 5-7, 2006, New York, USA.

There is a more recent version of this eprint available. Click here to view it.


It is practically impossible to build a word-based lexicon for speech recognition in agglutinative languages that would cover all the relevant words. The problem is that words are generally built by concatenating several prefixes and suffixes to the word roots. Together with compounding and inflections this leads to millions of different, but still frequent word forms. Due to inflections, ambiguity and other phenomena, it is also not trivial to automatically split the words into meaningful parts. Rule-based morphological analyzers can perform this splitting, but due to the handcrafted rules, they also suffer from an out-of-vocabulary problem. In this paper we apply a recently proposed fully automatic and rather language and vocabulary independent way to build subword lexica for three different agglutinative languages. We demonstrate the language portability as well by building a successful large vocabulary speech recognizer for each language and show superior recognition performance compared to the corresponding word-based reference systems.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Oral)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:User Modelling for Computer Human Interaction
Natural Language Processing
ID Code:2196
Deposited By:Mikko Kurimo
Deposited On:18 September 2006

Available Versions of this Item