|
A Large Margin Algorithm for Speech and Audio Segmentation AbstractWe describe and analyze a discriminative algorithm for learning to segment an audio signal given a sequence of events that tags the signal. We demonstrate the applicability of our method through the tasks of speech phoneme segmentation and music-to-score alignment. In the former task, the events that tag the speech signal are phonemes and in the latter task, the events are musical notes. Our goal is to learn a segmentation function whose input is an audio signal along with its accompanying event sequence and its output is a timing sequence representing the actual start time of each event in the audio signal. Generalizing the notion of separation with a margin used in support vector machines (SVM) for binary classification, we cast the learning task as the problem of finding a direction vector in an abstract vector-space. To do so, we devise a mapping of the input signal and the event sequence along with any possible timing sequence into an abstract vector-space. Thus, each possible timing sequence corresponds to a vector in the vector-space, and the predicted timing sequence is the one whose projection onto a direction vector in this vector-space is maximal. We set the direction vector to be the solution of a minimization problem with a large set of constraints. Each constraint enforces a gap between the projection of the correct target timing sequence and the projection of an alternative, incorrect, timing sequence onto the direction vector. Despite the large number of constraints, we provide a simple iterative algorithm for efficiently learning the direction vector and analyze the formal properties of the resulting learning algorithm. We experiment with our learning algorithm in applications of phonetic segmentation and music-to-score alignment by comparing its performance to the results obtained by a generative hidden Markov model (HMM) for segmentation. Our experiments indicate that the discriminative algorithm significantly outperforms the commonly used HMM-based approach.
[Edit] |