Document Clustering in a Learned Concept Space
PhD thesis, University Pierre and Marie Curie.
Document clustering is one of the fundamental techniques of unsupervised learning from unstructured textual data which constitutes a real saving in terms of efficiency for various information retrieval (IR) tasks. The clustering results are not only used as basic information for the structure of a collection, but also as a preceding step before conducting other IR applications. On the other hand, probabilistic models provide a useful framework for the data analysis in unsupervised learning. They can be used as dimensionality reduction techniques providing a compact representation of a collection or as clustering techniques. Especially, topic models have been rapidly developed and became popular tools among these models.
In this thesis, we are interested in to develop effective clustering techniques which allow to find meaningful reduced spaces on which document clustering may be performed more efficiently than in the initial bag-of-words space. With this purpose, we develop four different clustering approaches for text collection using probabilistic models and more precisely with topic models. We especially try to integrate the dimensionality reduction induced by latent variables which compose a concept space and perform clustering in that space. Our experimental results confirm that our attempts are successful in terms of clustering accuracy on different data collections.
This thesis is structured in two parts. The first part presents the state-of-the-art in clustering and probabilistic models and the second part corresponds to our contributions. We first develop a two-stage clustering method applying concept space. Inspired by its success, we develop the three clustering approaches based on probabilistic latent semantic analysis (PLSA). Ext-PLSA model supplements the previous approach by combining two stages in a process. CS-PLSA algorithm allows an effective model selection for clustering. Finally, Voted-PLSA provides a successful multi-view clustering procedure on a multilingual collection.