String Similarity Measures and PAM-like Matrices for Cognate Identification
Antonella Delmestri and Nello Cristianini
We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environment. We design a linguistic-inspired matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence alignment, in order to produce substitution matrices. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We develop and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adjusted to datasets using any other alphabet, including the phonetic alphabet if data in phonetic transcription were available. We compare our system with other models reported in the literature and the results show that our method outperforms in terms of precision both orthographic and phonetic approaches formerly presented.