Large Margin Algorithms for Discriminative Continuous Speech Recognition
PhD thesis, The Hebrew University.
Automatic speech recognition has long been a considered dream. While ASR does work today, and it is commercially available, it is extremely sensitive to noise, talker variations, and environments. The current state-of-the-art automatic speech recognizers are based on generative models that capture some temporal dependencies such as hidden Markov models (HMMs). While HMMs have been immensely important in the development of large-scale speech processing applications and in particular speech recognition, their performance is far from the performance of a human listener. HMMs have several drawbacks, both in modeling the speech signal and as learning algorithms. The present dissertation develops fundamental algorithms for continuous speech recognition, which are not based on the HMMs. These algorithms are based on latest advances in large margin and kernel methods, and they aim at minimizing the error induced by the speech recognition problem.
The introduction consists of a basic introduction of the current state of automatic speech recognition with the HMM and its limitations. We also present the advantages of the large margin and kernel methods and give a short outline of the thesis.
In Chapter 2 we present large-margin algorithms for the task of hierarchical phoneme classification. Phonetic theory of spoken speech embeds the set of phonemes of western languages in a phonetic hierarchy where the phonemes constitute the leaves of the tree, while broad phonetic groups --- such as vowels and consonants --- correspond to internal vertices. Motivated by this phonetic structure, we propose a hierarchical model that incorporates the notion of the similarity between the phonemes and between phonetic groups. As in large margin methods, we associate a vector in a high dimensional space with each phoneme or phoneme group in the hierarchy. We call this vector the prototype of the phoneme or the phoneme group, and classify feature vectors according to their similarity to the various prototypes. We relax the requirements of correct classification to large margin constraints and attempt to find prototypes that comply with these constraints. In the spirit of Bayesian methods, we impose similarity requirements between the prototypes corresponding to adjacent phonemes in the hierarchy. The result is an algorithmic solution that may tolerate minor mistakes --- such as predicting a sibling of the correct phoneme --- but avoids gross errors, such as predicting a vertex in a completely different part of the tree. The hierarchical phoneme classifier is an important tool in the subsequent tasks of speech-to-phoneme alignment and keyword spotting.
In Chapter 3 we address the speech-to-phoneme alignment problem, namely the proper positioning of a sequence of phonemes in relation to a corresponding continuous speech signal (this problem also referred to as ``forced alignment''). The speech-to-phoneme alignment is an important tool for labeling speech datasets for speech recognition and for training speech recognition systems. Conceptually, the alignment problem is a fundamental problem in speech recognition, for any speech recognition can theoretically be built using a speech-to-phoneme alignment algorithm, simply by evaluating all possible alignments of all possible phoneme sequences and choosing the phoneme sequence which attains the best confidence. The alignment function we devise is based on mapping the speech signal and its phoneme representation along with the target alignment into an abstract vector-space. Building on techniques used for learning SVMs, our alignment function distills to a classifier in this vector-space, which is aimed at separating correct alignments from incorrect ones. We describe a simple online algorithm for learning the alignment function and discuss its formal properties. We show that the large margin speech-to-phoneme alignment algorithm outperforms the standard HMM method.
In Chapter 4 we present a discriminative algorithm for a sequence phoneme recognizer, which aims at minimizing the Levenshtein distance (edit distance) between the model-based predicted phoneme sequence and the correct one.
In Chapter 5 we present an algorithm for finding a word in a continuous spoken utterance. The algorithm is based on our previous algorithms and it is the first task demonstrating the advantages of discriminative speech recognition. The performance of a keyword spotting system is often measured by the area under the Receiver Operating Characteristics (ROC) curve, and our discriminative keyword spotter aims at maximizing it. Moreover, our algorithm solves directly the keyword spotting problem (rather than using a large vocabulary speech recognizer) and does not estimate any garbage or background model. We show that the discriminative keyword spotting outperforms the standard HMM method.
We conclude the thesis with Discussion, where we present an extension to our work on full blown large vocabulary speech recognition and language modeling.