PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Efficient Computation of Gap-weighted String Kernels on Large Alphabets
Juho Rousu and John Shawe-Taylor
In: Learning Methods for Text Understanding and Mining, 26 - 29 January 2004, Grenoble, France.


We present a sparse dynamic programming algorithm that, given two strings s, t, a gap penalty lambda, and an integer p, computes the value of the gap­weighted length-­p subsequences kernel. The algorithm works in time O(p|M| log min(|s|, |t|)), where M = {(i, j)|s_i = t_j } is the set of matches of characters in the two sequences. The new algorithm is empirically evaluated against a full dynamic programming approach and a trie­based algorithm on synthetic data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium­sized alphabets the trie­based approach is best if the maximum number of allowed gaps is strongly restricted.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
Postscript - Requires a viewer, such as GhostView
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
Theory & Algorithms
Information Retrieval & Textual Information Access
ID Code:24
Deposited By:Steve Gunn
Deposited On:09 May 2004