|
Some observations on the applicability of normalized compression distance to stemmatology AbstractThe objective of stemmatology is to construct a family-tree of documents that have been generated by a process of repeated duplication and modification. In earlier benchmark experiments on computer-assisted stemmatology, the CompLearn software package was found to perform well on simpler test cases, but it failed to give satisfactory results in a more complex and realistic data set. This was surprising, given the excellent results in related phylogenetic tasks where it was able to reconstruct accurate family-trees of biological species based on their genome sequences. We suggest that the reason for the failure in the complex stemmatological data set is due to difficulties in handling missing data. This explains many features in the incorrect solution produced by CompLearn, and leads to a simple random imputation strategy to fill in the missing values. The strategy is shown to improve the performance by a large margin.
[Edit] |