PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Some observations on the applicability of normalized compression distance to stemmatology
Toni Merivuori and Teemu Roos
In: 2nd Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-09), 17-19 Aug 2009, Tampere, Finland.

Abstract

The objective of stemmatology is to construct a family-tree of documents that have been generated by a process of repeated duplication and modification. In earlier benchmark experiments on computer-assisted stemmatology, the CompLearn software package was found to perform well on simpler test cases, but it failed to give satisfactory results in a more complex and realistic data set. This was surprising, given the excellent results in related phylogenetic tasks where it was able to reconstruct accurate family-trees of biological species based on their genome sequences. We suggest that the reason for the failure in the complex stemmatological data set is due to difficulties in handling missing data. This explains many features in the incorrect solution produced by CompLearn, and leads to a simple random imputation strategy to fill in the missing values. The strategy is shown to improve the performance by a large margin.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Computational, Information-Theoretic Learning with Statistics
ID Code:5528
Deposited By:Teemu Roos
Deposited On:15 February 2010