PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

mGene: A Novel Discriminative Gene Finding System
Gabriele Schweikert, Georg Zeller, Alexander Zien, Jonas Behr, Cheng Soon Ong, Petra Philips, Anja Bohlen, Sören Sonnenburg and Gunnar Raetsch
Genome Research Volume 19, Number 11, pp. 2133-2143, 2009.


We present a highly accurate gene prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models with the predictive power of modern machine learning meth- ods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans (Coghlan et al., 2008). Considering the average of sensitivity and specificity the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene pre- diction tasks. The fully developed version shows superior performance in ten out of twelve evaluation criteria compared to the other participating gene finders, including Fgenesh++ (Salamov and Solovyev, 2000) and Augustus (Stanke et al., 2006). An in-depth analysis of mGene's genome-wide predictions revealed that $\approx$ 2, 200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42\%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8\%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica and C. remanei (Stein et al., 2003; Sternberg et al., 2003). Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate. Availability: mGene is available as source code under Gnu Public License from the project website and as a Galaxy-based webserver at Moreover, the gene predictions have been included in the Worm- base annotation available at and the project website.

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
ID Code:6852
Deposited By:Sören Sonnenburg
Deposited On:08 April 2010