A Methodology for Topographic Clustering of Structured Text Documents
Marie-Jeanne Lesot, Delphine Dard and Florence d'Alché-Buc
In: Learning Methods for Text Understanding and Mining, 26 - 29 January 2004, Grenoble, France.
Sets of texts are structured through a more or less refined hierarchy of sections, subsections and paragraphs; this structure contains information that should be exploited to handle these data and in particular, to enrich the comparison of texts, as a complement to the vector description of their contents. We propose a kernel-based methodology that follows this principle for a topographic clustering task and define a hierarchical kernel which compares paragraphs using
the available hierarchical decomposition and in particular the provided titles.