Nonparametric mixed membership models using the IBP compound Dirichlet process
Often the assumptions of mixture modelling, namely that each data point belongs to one of a ﬁnite or countable number of distributions, are overly restrictive. In many real-life datasets, individual data points exhibit features associated with multiple clusters: a movie may contain elements of both romance and comedy or an individual member of a population may exhibit traits from multiple subpopulations. Mixed membership models are a hierarchical variant of mixture models used for modelling grouped data, where each individual data point consists of a collection of observations. Rather than being assigned to a single component, each data point is associated with a distribution over components, allowing us to capture more complicated relationships between data points than is possible with a simple mixture model. One example of a dataset where a mixed membership assumption is appropriate is a corpus of text documents: each document is a data point and consists of a collection of words. In such an application, each component of the mixture model is a distribution over words, each document is associated with a distribution over these components and each word is associated with a single component. This framework is often referred to as ‘topic modelling’, since we typically ﬁnd that the posterior components (called ‘topics’) reﬂect the semantic themes of the documents.