Learning Topic Hierarchies and Thematic Annotations from Document Collections
Hermine Njike-Fotzo and Patrick Gallinari
In: Learning Methods for Text Understanding and Mining, 26 - 29 January 2004, Grenoble, France.
In this context, we study here how to automatically structure collections by deriving concept hierarchies from a document collection and how to automatically generate from that a document hierarchy. The concept hierarchy relies on the discovering of “specialization/generalization” relations between the concepts which appear in the documents of a corpus. Concepts are automatically identified from the set of documents. The proposed method can also create “specialization/generalization” links between documents and document parts. It is a technique for the automatic creation of specific typed links between information parts. Such typed links have been advocated by different authors as a mean for structuring and navigating collections. It also associates to each document a set of themes representative of the main subjects treated in the document. The method is fully automatic and the hierarchies are directly extracted from the corpus, and could be used for any
document collection. It could also serve as a basis for a manual organization.