Authorship attribution of e-mail as a multi-class task
In: PAN at CLEF 2011, 21-22 Sep 2011, Amsterdam, The Netherlands.
In this paper, we describe a multi-class text categorization approach to authorship attribution and test it on sets of e-mail collections. The PAN 2011 competition data consists of e-mails of variable length, written by various candidate authors, with some represented by signiﬁcantly longer or more e-mails than others. Rather than construct a classiﬁer for each separate author to discriminate it from the others (i.e. binary classiﬁcation), we adopt a multi-class scheme where all authorship classes are learned simultaneously. We explore the effect of the selection of feature types and of the C parameter in the SVMmulticlass learning algorithm. Variable-length lexical features showed promising results, nevertheless our authorship attribution approach only scored a mid position amongst the other competitors, for the SMALL as well as the LARGE test sets.