PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Highlighting Latent Structures in Text
Michèle Jardino
In: Learning Methods for Text Understanding and Mining, 26 - 29 January 2004, Grenoble, France.

Abstract

We have developed an original learning method in order to extract latent structures in raw texts. The induced structure is a data-driven tree which can be unbalanced. It has been obtained from successive partitions of the texts in clusters, with an incremental number of classes ranging from 2 to K; each quasi-optimal partition has been performed with an adaptation of the k-means clustering. The paths of the texts in the successive partitions are the edges of an oriented graph whose nodes are the clusters. The study of the paths shows that some of the clusters remain identical in the successive partitions so that a tree can be extracted from the graph, by merging nodes and clipping edges. A corpus of 1,100 touring information leaflets has been used to illustrate this method.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
Postscript - Requires a viewer, such as GhostView
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:26
Deposited By:Steve Gunn
Deposited On:09 May 2004