PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Multi-spectral biclustering for data described by multiple similarities
Farida Zehraoui and Florence d'Alché-Buc
In: Machine Learning in Systems Biology, 13-14 Sept 2008, Brussels.


In computational biology, objects of interest such as proteins or genes can be described from various points of view as sequences, trees, nodes in a graph, vectors... Often only similarity matrices are available to represent each of these heterogeneous views. Investigating the relationships among these data is an important step toward understanding the biological functions. Existing data mining approaches, which deal with heterogeneous data, aim to extract objects that are similar among all the views. As the number of datasets increases, it is often the case that no subsets of objects are similar simultaneously following all the views, except in trivial cases. We propose an extension of biclustering, called multi-spectral biclustering, that allows to find subgroups of objects that are similar to each other according some of the views. The new algorithm is based on multiple low dimensional embeddings of the data using Laplacian of graphs [1] weighted by the various similarities and a generalization of the squared residue minimization biclustering algorithm [5]. We also propose to select biclustering parameters using a stability criterion [2]. We have tested the muti-spectral approach on two biological applications and got good results. The first one contains two classes of proteins (membrane proteins and ribosomal proteins) described by several data sets (the protein sequences, the hydropathy profiles of the proteins, etc.) [4]. The second one contains yeast time series genes expressions measured in several conditions differing by the kind of the strain (wt, mec1, dun1) and the type of stress (MMS, Gamma, mock) [3]. References [1] Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in neural information processing systems, 14, 585-591 (2002). [2] Ben-Hur, A. Elisseeff, A. and Guyon, I. A stability based method for discovering structure in clustered data, Pac Symp Biocomput. 7, 6-17 (2002). [3] Gasch, A.P. , Huang, M. , Metzner, S., Botstein, D. Elledge, S.J. and Brown P.O. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol Biol Cell,vol. 12, 2987-3003 (2001). [4] Lanckriet, G-R.G., De Bie, T., Cristianini, N., Jordan M.I. and Noble, W.S., A statistical framework for genomic data fusion, Bioinformatics, 20(16), 2626—2635 (2004). [5] Sra, S., Cho, H., Dhillon I.S. and Guan, Y. Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data. SDM (2004)

EPrint Type:Conference or Workshop Item (Poster)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:5113
Deposited By:Florence d'Alché-Buc
Deposited On:24 March 2009