PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Shallow Text Analysis and Machine Learning for Authorship Attribution
Kim Luyckx and Walter Daelemans
In: Computational Linguistics in the Netherlands 2004. Selected papers from the fifteenth CLIN meeting LOT Occasional Series (4). (2005) LOT , Utrecht, The Netherlands , pp. 149-160. ISBN 90 76864 91 8

Abstract

Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experiments with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspaper De Standaard. Because the documents are in a similar genre, register, and range of topics, token-based (e.g., sentence length) and lexical features (e.g., vocabulary richness) can be kept roughly constant over the different authors. This allows us to focus on the use of syntax-based features as possible predictors for an author’s style, as well as on those token-based features that are predictive to author style more than to topic or register. These style characteristics are not under the author’s conscious control and therefore good clues for Authorship Attribution. Machine Learning methods (TiMBL and the WEKA software package) are used to select informative combinations of syntactic, token-based and lexical features and to predict authorship of unseen documents. The combination of these features can be considered an implicit profile that characterizes the style of an author.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Book Section
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
Information Retrieval & Textual Information Access
ID Code:4945
Deposited By:Kim Luyckx
Deposited On:24 March 2009