PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Morph-Based Speech Recognition and Modeling of Out-of-Vocabulary Words Across Languages
Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraclar and Andreas Stolcke
ACM Transactions on Speech and Language Processing (TSLP) Volume 5, Number 1, 2007. ISSN 1550-4875


We explore the use of morph-based language models in large-vocabulary continuous speech recognition systems across four so-called "morphologically rich'' languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the em Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception, since here the standard word model outperforms the morph model. Differences in the data sets and the amount of data are discussed as a plausible explanation.

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:3513
Deposited By:Mathias Creutz
Deposited On:11 February 2008