Audio Visual Speech Enhancement
This thesis presents a novel approach to speech enhancement by exploiting the bimodality of speech production and the correlation that exists between audio and visual speech information. An analysis into the correlation of a range of audio and visual features reveals significant correlation to exist between visual speech features and audio filterbank features. The amount of correlation was also found to be greater when the correlation is analysed with individual phonemes rather than across all phonemes. This led to building a Gaussian Mixture Model (GMM) that is capable of estimating filterbank features from visual features. Phoneme-specific GMMs gave lower filterbank estimation errors and a phoneme transcription is decoded using audio-visual Hidden Markov Model (HMM). Clean filterbank estimates along with mean noise estimates were then utilised to construct visually-derived Wiener filters that are able to enhance noisy speech. The mean noise estimates were computed from non-speech periods, identified by an audio-visual speech activity detection system proposed in this work. Subjective and objective speech quality evaluation was carried out and the visually-derived Wiener filtering was shown to be a powerful speech enhancement method.