PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Scalable corpus annotation by graph construction and label propagation
Thomas Lansdall-Welfare, Ilias Flaounas and Nello Cristianini
In: 1st International Conference on Pattern Recognition Applications and Methods, 6-8 Feb 2012, Vilamoura, Algarve, Portugal.

Abstract

The efficient annotation of documents in vast corpora calls for scalable methods of text classification. Representing the documents in the form of graph vertices, rather than in the form of vectors in a bag of words space, allows for the necessary information to be pre-computed and stored. It also fundamentally changes the problem definition, from a content-based to a relation-based classification problem. Efficiently creating a graph where nearby documents are likely to have the same annotation is the central task of this paper. We compare the effectiveness of various approaches to graph construction by building graphs of 800,000 vertices based on the Reuters corpus, showing that relation-based classification is competitive with Support VectorMachines, which can be considered as state of the art. We further show that the combination of our relation-based approach and Support Vector Machines leads to an improvement over the methods individually.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:8588
Deposited By:Thomas Lansdall-Welfare
Deposited On:13 February 2012