Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages
Mathias Creutz, Teemu Hirsimaki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraclar and Andreas Stolcke
In: NAACL-HLT 2007, 22-27 April 2007, Rochester, NY, USA.
We analyze subword-based language
models (LMs) in large-vocabulary
continuous speech recognition across
four “morphologically rich” languages:
Finnish, Estonian, Turkish, and Egyptian
Colloquial Arabic. By estimating n-gram
LMs over sequences of morphs instead
of words, better vocabulary coverage
and reduced data sparsity is obtained.
Standard word LMs suffer from high
out-of-vocabulary (OOV) rates, whereas
the morph LMs can recognize previously
unseen word forms by concatenating
morphs. We show that the morph LMs
generally outperform the word LMs and
that they perform fairly well on OOVs
without compromising the accuracy
obtained for in-vocabulary words.