## AbstractAutomatic speech recognition (ASR) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. While ASR does work today, and it is commercially available, it is extremely sensitive to noise, talker variations, and environments. The current state-of-the-art automatic speech recognizers are based on generative models that capture some temporal dependencies such as hidden Markov models (HMMs). While HMMs have been immensely important in the development of large-scale speech processing applications and in particular speech recognition, their performance is far from the performance of a human listener. In this work we present a different approach to speech recognition, which is not based on the HMM but on the recent advances in large margin and kernel methods. Despite their popularity, HMM-based approaches have several known drawbacks such as convergence of the training algorithm (EM) to a local maxima, conditional independence of observations given the state sequence and the fact that the likelihood is dominated by the observation probabilities, often leaving the transition probabilities unused~\cite{Ostendorf96,Young96}. However, the most acute weakness of HMMs for speech recognition task is that they do not aim at minimizing the word error rate. Segment models were proposed by Ostendorf and her colleagues~\cite{Ostendorf96,OstendorfDiKi96} to address some of the shortcomings of the HMMs. In summary, segment models can be thought of as a higher dimensional version of a HMM, where Markov states generate random sequences rather than a single random vector observation. The basic segment model includes an explicit segment-level duration distribution and a family of length-dependent joint distributions. The model proposed in this work generalizes the segment model approach, so it addresses the same shortcomings of the HMM already addressed by the segment models. In addition, our model addresses two important limitations of the HMMs and the segment models as learning algorithms. Namely, the direct minimizing of the word error rate and the convergence of the training algorithm to a global minima rather than to a local one. Moreover, our model is trained with a large margin algorithm which was found be be robust to noise. Last, our model can be easily transformed into a non-linear model.
[Edit] |