Multi-spectral biclustering for data described by multiple similarities.
In computational biology, objects of interest such as proteins or genes can be described from various points of view as sequences, trees, nodes in a graph, vectors... Often only similarity matrices are available to represent each of these heterogeneous views. Investigating the relationships among these data is an important step toward understanding the biological functions. Existing data mining approaches, which deal with heterogeneous data, aim to extract objects that are similar among all the views. As the number of datasets increases, it is often not possible to find subsets of objects simultaneously similar according all the views, except in trivial cases. We thus propose an extension of biclustering, called multi-spectral biclustering, that allows to find subgroups of objects that are similar to each other according some of the views. The new algorithm is based on multiple low dimensional embeddings of the data using Laplacian of graphs weighted by the various similarities and a generalization of the squared residue minimization biclustering algorithm. We also propose to select biclustering parameters using a stability criterion. We have sucessuflly tested muti-spectral biclustering on two biological applications and obtained very good results. The first application concerns two classes of proteins (membrane proteins and ribosomal proteins) described by several data sets (the protein sequences, the hydropathy profiles of the proteins, etc.). The second one deals with yeast time series genes expressions measured in several conditions differing by the kind of the strain (wt, mec1, dun1) and the type of stress (MMS, Gamma, mock).