Highlighting Latent Structures in Text
In: Learning Methods for Text Understanding and Mining, 26 - 29 January 2004, Grenoble, France.
We have developed an original learning method in order to extract latent structures in raw texts. The induced structure is a data-driven tree which can be unbalanced. It has been
obtained from successive partitions of the texts in clusters, with an incremental number of classes ranging from 2 to K; each quasi-optimal partition has been performed with an adaptation of the k-means clustering. The paths of the texts in the successive partitions are the edges of an oriented graph whose nodes are the clusters. The study of the paths shows that some of the clusters remain identical in the successive partitions so that a tree can be extracted from the graph, by merging nodes and clipping edges. A corpus of 1,100 touring information leaflets has been used to illustrate this method.