PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Learning Topic Hierarchies and Thematic Annotations from Document Collections
Hermine Njike-Fotzo and Patrick Gallinari
In: Learning Methods for Text Understanding and Mining Workshop, 26-29 jan 2004, Grenoble, France.

Abstract

Large textual and multimedia databases are now widely available but their exploitation is restricted by the lack of metainformation about their structure and semantics. Many such collections like those gathered by most search engines are loosely structured. Some have been manually structured, at the expense of an important effort. This is the case of hierarchies like those of internet portals (Yahoo, Open Directory, LookSmart, etc) or of large collections like MEDLINE: documents are gathered into topics, which are themselves organized into a hierarchy going from the most general to the most specific [7]. Hypertext multimedia products are another example of structured collections: documents are usually grouped into different topics and subtopics with links between the different entities. Generally speaking, structuring collections makes easier navigating the collection, accessing information parts, maintaining and enriching the collection. Manual structuring relies on a large amount of qualified human resources and can be performed only in the context of large collaborative projects like e.g. in medical classification systems or for specific commercial products. In order to help this process it would be needful to rely on automatic or semi-automatic tools for structuring document collections. In this context, we study here how to automatically structure collections by deriving concept hierarchies from a document collection and how to automatically generate from that a document hierarchy. The concept hierarchy relies on the discovering of “specialization/generalization” relations between the concepts which appear in the documents of a corpus. Concepts are automatically identified from the set of documents. The proposed method can also create “specialization/generalization” links between documents and document parts. It is a technique for the automatic creation of specific typed links between information parts. Such typed links have been advocated by different authors as a mean for structuring and navigating collections. It also associates to each document a set of themes representative of the main subjects treated in the document. The method is fully automatic and the hierarchies are directly extracted from the corpus, and could be used for any document collection. It could also serve as a basis for a manual organization. The paper is organized as follows. In section 2 we introduce previous related work. In section 3, we describe our algorithm for the automatic generation of typed “specialization/generalization” relations between concepts and documents and the corresponding hierarchies. In section 4 we propose numerical criteria for measuring the relevance of our method. Section 5, describes experiments performed on a part of Looksmart and New Scientists hierarchies.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Information Retrieval & Textual Information Access
ID Code:563
Deposited By:Hermine Njike-Fotzo
Deposited On:26 December 2004