Fisher Linear Semi-discriminant Analysis for Speaker Diarization
Given an audio signal with an unknown number of people speaking, speaker diarization aims to automatically answer the question ``who spoke when''. Crucial to the success of diarization is the distance metric between speech segments, a factor depending on the choice of the feature space: distances should be low for segments of the same speaker and high for segments of different speakers. Starting from an mfcc-based feature space, an algorithm is proposed that finds a Fisher near-optimal linear discriminant subspace, adapted to the particular speakers which exist in the audio signal. The proposed approach relies on a semi-supervised version of Fisher Linear Discriminant analysis (\fld), leveraging information from the sequential structure of the audio signal as a substitute for unknown speaker labels. The resulting algorithm is completely unsupervised, therefore the need for speaker labels in the provided or an independent set is dismissed. The eigenvalue perturbation theory is applied in order to provide optimality bounds with respect to \fld, showing the effectiveness of the approach under the assumption that speakers do not significantly modify the characteristics of their voice. A complete diarization system is then proposed, using fuzzy clustering, a non-parametric K-nearest neighbors classifier and a Hidden Markov Model. The experimental results show a major improvement of speaker diarization accuracy when using the optimal subspace found by the proposed approach with respect to using the initial mfcc feature space or subspaces found by competitive approaches.