An Automated Combination of Kernels for Predicting Protein Subcellular Localization
Protein subcellular localization is a crucial ingredient to many important inferences about cellular processes, including prediction of protein function and protein interactions. We propose a new class of protein sequence kernels which considers all motifs including motifs with gaps. This class of kernels allows the inclusion of pairwise amino acid distances into their computation. We utilize an extension of the multiclass support vector machine (SVM) method which directly solves protein subcellular localization without resorting to the common approach of splitting the problem into several binary classification problems. To automatically search over families of possible amino acid motifs, we optimize over multiple kernels at the same time. We compare our automated approach to four other predictors on three different datasets, and show that we perform better than the current state of the art. Furthermore, our method provides some insights as to which features are most useful for determining subcellular localization, which are in agreement with biological reasoning.