Vision for Multimodal Conversational Interfaces
In: Machine Learning Meets the User Interface, 12 December 2003, Whistler, Canada.
A classic goal for human computer interface is the conversational computer, which can freely interact with users through natural dialog. Advances in speech and language processing have made systems for single user conversation with a close-talking microphone almost commonplace--but when multiple speakers and noisy conditions are encountered, more information is needed. Visual cues can make conversational interfaces feasible in these environments, providing robust cues for speech and critical information about turn-taking, intent, and physical reference. In this talk I'll review recent research in our lab on computer vision techniques which can provide these cues. I'll describe work in progress on estimating pose-invariant mouth features for visual speechreading, head tracking for inferring turn-taking cues, agreement gestures, and conversational intent, and finally body pose estimation for resolving object deixis. All of our systems are designed for untethered interaction with multiple users in noisy environments, and to work in real-time.