Efficient Computation of Gapped Substring Kernels on Large Alphabets ## AbstractWe present a sparse dynamic programming algorithm that, given two strings $s,t$, a gap penalty $\lambda$, and an integer $p$, computes the value of the gap-weighted length-$p$ subsequences kernel. The algorithm works in time $O(p |M| \log \min(|s|,|t|))$, where $M = \set{(i,j)| s_i = t_j}$ is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted.
[Edit] |