PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

A compression-based method for stemmatic analysis
Teemu Roos, Tuomas Heikkilä and Petri Myllymäki
In: 17th European Conference on Artificial Intelligence (ECAI'06), 28 Aug - 1 Sep 2006, Riva del Garda, Italy.


Stemmatology studies relations among different variants of a text that has been gradually altered as a result of imperfectly copying the text over and over again. Applications are mainly in humanities, especially textual criticism, but the methods can be used to study the evolution of any symbolic objects, including chain letters and computer viruses.We propose an algorithm for stemmatic analysis based on a minimum-information criterion and stochastic tree optimization. Our approach is related to phylogenetic reconstruction criteria such as maximum parsimony and maximum likelihood, and builds upon algorithmic techniques developed for bioinformatics. Unlike many earlier methods, the proposed method does not require significant preprocessing of the data but rather, operates directly on aligned text files. We demonstrate our method on a real-world experiment involving all 52 known variants of the legend of St. Henry of Finland, and provide the first computer-generated family tree of the legend. The obtained tree of the variants is supported to a large extent by results obtained with more traditional methods, and identifies a number of previously unrecognized relations.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Poster)
Additional Information:This is an extended version of a two-page summary with the same title to appear in the Proceedings of ECAI'06.
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:2119
Deposited By:Teemu Roos
Deposited On:09 June 2006