PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Mapping Documents onto Web Page Ontology
Dunja Mladenić and Marko Grobelnik
In: Web Mining: From Web to Semantic Web Lecture Notes in Computer Science , Vol. 3209 (Lecture Notes in Artificial Intelligence , Vol. 3209). (2004) Springer-Verlag Heidelberg , Heidelberg , pp. 77-96. ISBN 3-540-23258-3

Abstract

The paper describes an approach to automatically mapping Web pages onto ontology using document classification based on the Yahoo! ontology of Web pages. Techniques developed for learning on text data are used here on the hierarchical classification structure (ontology of Web documents). The high number of features is reduced by taking into account the hierarchical structure and using feature subset selection developed for the Naive Bayesian classifier. We focus on data sets with many features that also have a highly unbalanced class distribution. Documents are represented as word-vectors that include word sequences of up to five consecutive words. Based on the hierarchical structure the problem is divided into subproblems, each representing one on the categories included in the Yahoo! hierarchy. The resulting model is a set of independent classifiers, each used to predict the probability that a new document is a member of the corresponding category represented as a node in the hierarchy. Our example problem is automatic document categorization where we want to identify documents relevant for the selected category. Usually, only about 1%-10% of examples belong to the selected category. Experimental evaluation on real-world data shows that the proposed approach gives good results. Our experimental comparison of eleven feature scoring measures show that considering data and algorithm characteristics significantly improves the performance.

Postscript - Requires a viewer, such as GhostView
EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Information Retrieval & Textual Information Access
ID Code:840
Deposited By:Marko Grobelnik
Deposited On:01 January 2005