PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Unsupervised Learning with Term Clustering for Thematic segmentation of Texts .
Marc Caillet, Jean-François Pessiot, Massih Amini and Patrick Gallinari
In: RIAO 2004, 26-28 April 2004, Toulouse, France.

Abstract

In this paper we introduce a machine learning approach for automatic text segmentation. Our text segmenter clusters text-segments containing similar concepts. It first discovers the different concepts present in a text, each concept being defined as a set of representative terms. After that the text is partitioned into coherent paragraphs using a clustering technique based on the Classification Maximum Likelihood approach. We evaluate the effectiveness of this technique on sets of concatenated paragraphs from two collections, the 7sectors and the 20 Newsgroups corpus, and compare it to a baseline text segmentation technique proposed by Salton et al.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Information Retrieval & Textual Information Access
ID Code:432
Deposited By:Massih Amini
Deposited On:22 December 2004