Scalability Issues in Authorship Attribution
PhD thesis, University of Antwerp.
This dissertation is about authorship attribution, the task that aims to identify the author of a text, given a model of authorial style based on texts of known authorship. In computational authorship attribution, we do not rely on in-depth reading, but rather automate the process. We take a text categorization approach that combines computational analysis of writing style using Natural Language Processing with a Machine Learning algorithm to build a model of authorial style and attribute authorship to a previously unseen text.
In traditional applications of authorship attribution - for instance the investigation of disputed authorship or the analysis of literary style - we often find large sets of textual data of the same genre and small sets of candidate authors.
Most approaches are able to reliably attribute authorship in cases like these. However, the types of data that we find online, require an approach that is able to deal with large sets of candidate authors, a large variety of topics, and often very short texts. Even though the last decades of research have brought substantial innovation, most studies only scratch the surface of the task because they are limited to small and strictly controlled problem sets. As a result, it is uncertain how any of the proposed approaches will perform on a large scale. In addition, the often vague descriptions of experimental design and the underuse of objective evaluation criteria and of benchmark data sets, cause problems for the replicability and evaluation of some studies. Since most studies focus on quantitative evaluation of results but refrain from going into detail about the features of text used to attribute authorship, it is difficult to assess the quality of their approach.
In this dissertation, we investigate whether a commonly applied text categorization approach is viable for application on a large scale - for instance in the detection of fraud or in social media analysis. In this context, scalability refers to the ability of a system to achieve consistent performance under various uncontrolled settings. We stress-test our approach by confronting it with various scalability issues and study its behavior in detail. By combining performance analysis with an in-depth analysis of features, we aim at increased insight into the strengths and weaknesses of the approach.