PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Unsupervised Learning with Term Clustering for Thematic Segmentation of Texts
Marc Caillet, Jean-François Pessiot, Massih Amini and Patrick Gallinari
In: RIAO 2004, 26-28 Apr 2004, Avignon, France.


In this paper we introduce a machine learning approach for automatic text segmentation. Our text segmenter clusters text-segments containing similar concepts. It first discovers the different concepts present in a text, each concept being defined as a set of representative terms. After that the text is partitioned into coherent paragraphs using a clustering technique based on the Classification Maximum Likelihood approach. We evaluate the effectiveness of this technique on sets of concatenated paragraphs from two collections, the 7sectors and the 20 Newsgroups corpus, and compare it to a baseline text segmentation technique proposed by Salton et al.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Information Retrieval & Textual Information Access
ID Code:2644
Deposited By:Massih Amini
Deposited On:22 November 2006