Data fusion based gene function prediction using ensemble methods.
Matteo Re and Giorgio Valentini
In: BITS 2009. Sixth Annual Meeting of the Bioinformatics Italian Society, 18-20 Mar 2009, Genova, Italy.
The integration of multiple sources of heterogeneous biomolecular data is a key item for the prediction of gene functions at genomewide level. Several approaches have been proposed in the
literature, ranging from function linkage networks, to graphical models, vectorspace integration
and kernel fusion methods.
Nevertheless, all these approaches suffer of limitations and drawbacks, due to their limited scalability to multiple data, to their limited modularity when new data sources are added, or when
the available biomolecular data are characterized by different structural features.
A new possible approach is based on ensemble methods, i.e. committees of learning machines, but
to our knowledge not too much works have been proposed in literature.
There are several reasons to apply ensemble methods in the context of genomic data fusion for gene
function prediction. At first, biomolecular data that differ for their structural characteristics (e.g.
numerical vectors, strings or graphs) can be easily integrated, because with ensemble methods the
integration is performed at the decision level. Moreover as new types of bio molecular data are
made available, previously trained ensembles are able to embed new data sources by training only
the base learners devoted to the newly added data, without retraining the entire ensemble. Finally
most ensemble methods scale well with the number of the available data sources, and problems
affecting other data fusion approaches are thus avoided.
We performed our tests by integrating S.cerevisiae data collected from literature and comprising
proteinprotein interactions, protein domains, expression data and BLASTP pairwise similarities.
The investigated genes were labelled according to the functional annotations available in the MIPS
Functional Catalogue (FunCat) version 2.1 considering only the 15 functional classes at highest
level in the FunCat hierarchy. The gene function prediction has been performed in a binary
classification setting classifying all the genes as belonging to the current “target” functional class or
to “other functional classes”.
We applied a sigmoid fitting to the output of SVMs (each trained on different datasets), in order to
obtain an estimate of the probability that a given example belongs to a functional class. We then
combined the output produced by the SVMs through the classical weighted average rule, using
weights calculated according to a convex combination rule and a logarithmic transformation, and
the Decision Templates combiner.
According to the test and select methodology, we applied a variant of the "choose the best"
technique to select, for each function prediction task, several subsets of "optimal" classifiers.
The investigated ensemble methods were then used to combine the outputs produced by all the
component classifiers and all the selected combinations of base learners. Then we added a simple
feature selection method based on the ttest and BenjaminiHochberg pvalue correction to select
the most relevant features and to reduce the high dimensionality that characterize biomolecular data.
We compared the performances obtained by the tested strategies under different experimental conditions in order to provide an overview of the capabilities of ensemble systems in data fusion
mediated gene function prediction.
The ensembles outperformed the base learners in all the function prediction tasks. Performances are
also improved by the applied classifier selection strategy and the feature filtering method.
Considering the Fmeasure that summarizes both precision and recall, the experimental results show
that data fusion realized by means of ensemble systems is a valuable research line in gene function
prediction and that Decision Templates may represent a good choice for biomolecular data integration