The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling
Sinead Williamson, Chong Wang, Katherine Heller and David Blei
In: ICML 2010, Haifa, Israel(2010).
The hierarchical Dirichlet process (HDP) is
a Bayesian nonparametric mixed membership
model—each data point is modeled with a collection of components of different proportions.
Though powerful, the HDP makes an assumption
that the probability of a component being exhibited by a data point is positively correlated with
its proportion within that data point. This might
be an undesirable assumption. For example, in
topic modeling, a topic (component) might be
rare throughout the corpus but dominant within
those documents (data points) where it occurs.
We develop the IBP compound Dirichlet process
(ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data
proportion in a mixed membership model. The
ICD combines properties from the HDP and the
Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components
to each data point. This subset, the data point’s
“focus”, is determined independently from the
amount that each of its components contribute.
We develop an ICD mixture model for text, the
focused topic model (FTM), and show superior
performance over the HDP-based topic model.