PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

A probabilistic framework for mismatch and profile string kernels
A. Vinokourov, Craig Saunders and A. Soklakov
(2005) Technical Report. PASCAL.

Abstract

There has recently been numerous applications of kernel methods in the field of bioinformatics. In particular, the problem of protein homology has served as a benchmark for the performance of many new kernels which operate directly on strings (such as amino-acid sequences). Several new kernels have been developed and successfully applied to this type of data, including spectrum, string, mismatch, and profile kernels. In this paper we introduce a general probabilistic framework for string kernels that is based on considering limit cases of sequences which are generated as mutations of the original sequence. The framework uses the fisher-kernel approach and includes spectrum, mismatch and profile kernels, among others, as special cases. Moreover, it not only generalizes the existent kernels, it also synthesizes a new kernel - a gappy profile kernel which logically results from an earlier introduced generalized Markov model. The use of a probabilistic model provides additional flexibility both in definition and for the re-weighting of features through feature selection methods, prior knowledge or semi-supervised approaches which use data repositories such as BLAST. Furthermore we consider introducing a non-mutation probability which can be used with the framework and standard profile kernels, and see that this parameter can have a strong affect on the results. We give details of the framework, place well-known kernels in the framework and give preliminary experimental results which show some effects of using the probabilistic approach and the gappy profile kernel.

PDF - PASCAL Members only - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Monograph (Technical Report)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:1766
Deposited By:Craig Saunders
Deposited On:28 November 2005