PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

A contextual normalised edit distance
Colin de la Higuera and Luisa Mico
In: SISAP 2007, 10-12 Apr 2007, Cancun, Mexico.

Abstract

In order to better fit a variety of pattern recognition problems over strings, using a normalised version of the edit or Levenshtein distance is considered to be an appropriate approach. The goal of the normalisation is to take into account the lengths of the strings. But a challenging question is to define a normalisation process that at the same time maintains the desired mathematical properties (and specifically the triangular inequality), is meaningful, and can be computed in an economical way. We define a new normalisation, contextual, where each edit operation is divided by the length of the string on which the edit operation takes place (more precisely, on the length of the longest of the two strings involved). We prove that this contextual edit distance is a metric and that it can be computed through an extension of the usual dynamic programming algorithm for the edit distance. We show over several experiments that the distance can be computed fast, obtains good results in classification tasks and has a low intrinsic dimension in comparison with other normalised edit distances.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Information Retrieval & Textual Information Access
ID Code:3260
Deposited By:Colin de la Higuera
Deposited On:02 February 2008