Intrinsic plagiarism detection using character trigram distance scores
Mike Kestemont, Kim Luyckx and Walter Daelemans
In: PAN at CLEF 2011, 21-22 Sep 2011, Amsterdam, The Netherlands.
In this paper, we describe a novel approach to intrinsic plagiarism detection. Each suspicious document is divided into a series of consecutive, potentially overlapping ‘windows’ of equal size. These are represented by vectors containing the relative frequencies of a predetermined set of high-frequency character trigrams. Subsequently, a distance matrix is set up in which each of the document’s windows is compared to each other window. The distance measure used is a symmetric adaptation of the normalized distance (nd1) proposed by Stamatatos . Finally, an algorithm for outlier detection in multivariate data (based on Principal Components Analysis) is applied to the distance matrix in order to detect plagiarized sections. In the PAN-PC-2011 competition, this system (second place) achieved a competitive recall (.4279) but only reached a plagdet of .1679 due to a disappointing precision (.1075).