Robustness and Statistical Significance of PAM-like Matrices for Cognate Identification
Antonella Delmestri and Nello Cristianini
Abstract: This paper tests the influence of the training dataset dimension on a recently proposed orthographic learning system, inspired from biological sequence analysis and successfully applied to cognate identification. This system automatically aligns a given set of cognate pairs producing a meaningful training dataset, learns from it substitution parameters using a PAM-like technique and utilises them to recognise cognate pairs. The results show no difference in the performance when training the system with about 650 cognate pairs extracted from 6 Indo-European languages or with about 62,000 cognate pairs extracted from 76 Indo-European languages. In both cases the system outperforms all comparable orthographic and phonetic methods previously proposed in the literature. This paper also investigates the statistical significance of these results when compared with earlier proposals. The outcome confirms that the performance reached by this system with both training datasets is significantly higher than the one achieved by all the other methods. Indeed, the training dataset dimension seems not to influence either the accuracy or the statistical significance of this learning system that needs only a very small amount of data to reach an outstanding performance.