PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Efficient Computation of Gapped Substring Kernels on Large Alphabets
Juho Rousu and John Shawe-Taylor
Journal of Machine Learning Research 2004.


We present a sparse dynamic programming algorithm that, given two strings $s,t$, a gap penalty $\lambda$, and an integer $p$, computes the value of the gap-weighted length-$p$ subsequences kernel. The algorithm works in time $O(p |M| \log \min(|s|,|t|))$, where $M = \set{(i,j)| s_i = t_j}$ is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted.

Postscript - PASCAL Members only - Requires a viewer, such as GhostView
EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:335
Deposited By:Juho Rousu
Deposited On:12 December 2004