Structure inference for Bayesian multisensory scene understanding
We investigate a solution to the problem of multi-sensor perception and tracking by formulating it in the framework of Bayesian model selection. Humans robustly associate multi-sensory data as appropriate, but previous theoretical work has focused largely on purely integrative cases, leaving segregation unaccounted for and unexploited by machine perception systems. We illustrate a unifying, Bayesian solution to multisensor perception and tracking which accounts for both integration and segregation by explicit probabilistic reasoning about data association in a temporal context. Explicit inference of multisensory data association may also be of intrinsic interest for higher level understanding of multisensory data. We illustrate this using a probabilistic model of audio-visual data in which unsupervised learning and inference provide automatic audio-visual detection and tracking of two human subjects, speech segmentation, and association of each conversational segment with the speaking person.