Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages
Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraclar and Andreas Stolcke
In: Human Language Technologies / The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), 23-25 Apr 2007, Rochester, NY, USA.
We analyze subword-based language models (LMs) in
large-vocabulary continuous speech recognition across four
"morphologically rich'' languages: Finnish, Estonian,
Turkish, and Egyptian Colloquial Arabic. By estimating n-gram LMs over sequences of morphs instead of words, better vocabulary coverage and reduced data sparsity
is obtained. Standard word LMs suffer from high out-of-vocabulary (OOV) rates, whereas the morph LMs can recognize previously unseen word forms by
concatenating morphs. We show that the morph LMs generally
outperform the word LMs and that they perform
fairly well on OOVs without compromising the accuracy obtained for in-vocabulary words.