Accurate Splice Site Detection for Caenorhabditis elegans
We propose a new system for predicting the splice form of Caenorhabditis elegans genes. As a ﬁrst step we generate a clean set of genes from available exressed sequence tags (EST) and complete complementary (cDNA) sequences. From all such genes we then generate potential acceptor and donor sites as they would be required by any gene ﬁnder. This leads to a clean set of true and decoy splice sites. In a second step we use support vector machines (SVMs) with appropriately designed kernels to learn to distinguish between true and decoy sites. Using the newly generated data and the novel kernels we could considerably improve our previous results on the same task. In the last step we design and test a new splice ﬁnder system that combines the SVM predictions with additional statistical information about splicing. Using this system we are able to predict the exon-intron structure of a given gene with known translation initiation and stop codon site. The system has been tested successfully on a newly generated set of genes and compared with GenScan.We found that our system predicts the correct splice form for more than 92% of these genes, whereas GenScan only achieves 77.5% accuracy.