In: Pattern Recognition and Machine Learning in Computer Vision Workshop, 3-5 May 2004, Grenoble, France.
A simple and efficient way to model much image and video data is to decompose it into a set of 2-dimensional objects in layers. Each object is characterized by its shape and appearance (as with a "sprite" in computer graphics). Following earlier work on layer decompositions in computer vision (e.g. Wang and Adelson, 1994), Frey and Jojic (1999) stated the sprite-learning problem in terms of transformation-invariant clustering using mixture models and EM. This was later extended (Jojic and Frey, 2001) to learning multiple sprites/objects from a video sequence. The approach of building in knowledge about allowable transformations into the clustering algorithm is an important way that a machine learning algorithm (clustering) needs to be tailored to the computer vision domain. Frey and Jojic's approach to learning multiple sprites uses variational inference simultaneously on all sprites; we also discuss recent work by Williams and Titsias (2004) who describe a greedy sequential algorithm for this task.