PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Authorship attribution of e-mail as a multi-class task
Kim Luyckx
In: PAN at CLEF 2011, 21-22 Sep 2011, Amsterdam, The Netherlands.


In this paper, we describe a multi-class text categorization approach to authorship attribution and test it on sets of e-mail collections. The PAN 2011 competition data consists of e-mails of variable length, written by various candidate authors, with some represented by significantly longer or more e-mails than others. Rather than construct a classifier for each separate author to discriminate it from the others (i.e. binary classification), we adopt a multi-class scheme where all authorship classes are learned simultaneously. We explore the effect of the selection of feature types and of the C parameter in the SVMmulticlass learning algorithm. Variable-length lexical features showed promising results, nevertheless our authorship attribution approach only scored a mid position amongst the other competitors, for the SMALL as well as the LARGE test sets.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:8687
Deposited By:Kim Luyckx
Deposited On:19 February 2012