Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees
Konrad Rieck, Pavel Laskov and Sören Sonnenburg
In: NIPS 2006, 4-8 Dec 2006, Vancouver, Canada.
We propose a generic algorithm for computation of similarity
measures for sequential data. The algorithm uses generalized suffix
trees for efficient calculation of various kernel, distance and
non-metric similarity functions. Its worst-case run-time is linear
in the length of sequences and independent of the underlying
embedding language, which can cover words, k-grams or all
contained subsequences. Experiments with network intrusion
detection, DNA analysis and text processing applications demonstrate
the utility of distances and similarity coefficients for sequences
as alternatives to classical kernel functions.