Decoding speech in the presence of other sources
J.P. Barker, M.P. Cooke and D.P.W. Ellis
The statistical theory of speech recognition introduced several decades ago has brought about low word error rates for clean speech. However, it has been less successful in noisy conditions. Since extraneous acoustic sources are present in virtually all everyday speech communication conditions, the failure of the speec h recognition model to take noise into account is perhaps the most serious obstacle to the application of ASR technology.
Approaches to noise-robust speech recognition have traditionally taken one of two forms. One set of techniques attempts to estimate the noise and remove its effects from the target speech. While noise estimation can work in low-to-moderate levels of slowly-varying noise, it fails completely in louder or more variable conditions. A second approach utilises noise models and attempts to decode speech taking into account thei presence. Again, model-based techniques can work for simple noises, but they are computationally complex under realistic conditions and require models for all sources present in the signal.
In this paper, we propose a statistical theory of speech recognition in the pres ence of other acoustic sources. Unlike earlier model-based approaches, our framework makes no assumptions about the noise background, although it can exploit su ch information if it is available. It does not require models for background sources, nor an estimate of their number. The new approach extends statistical ASR by introducing a segregation model in addition to the conventional acoustic and language models. While the conventional statistical ASR problem is to find the most likely sequence of speech models which generated a given observation sequence, the new approach additionally determines the most likely set of signal fragments which make up the speech signal. Although the framework is completely general, we provide one interpretation of the segregation model based on missing-data theory. We derive an efficient HMM decoder which searches both across subword state and across alternative segregations of the signal between target and interference. We call this modified system the speech fragment decoder.
The value of the \msdecoder approach has been verified through experiments on small-vocabulary tasks in high-noise conditions. For instance, in a noise-corrupted connected digit task, the new approach decreases the word error rate in the condition of factory noise at 5~dB SNR from over 59% for a standard ASR system to less than 22%.
|Project Keyword:||Project Keyword UNSPECIFIED|
|Deposited By:||Jon Barker|
|Deposited On:||29 December 2005|