Structural and Relational Data Mining for Systems Biology Applications
Due to the enormous accumulation of experimental data and the increasing need for combining heterogeneous data sources, the field of systems biology yields novel and very interesting problems in data analysis. The development of high-throughput technologies has opened the possibility to study the behavior of many cellular components simultaneously. Therefore, there is an increasing interest and effort in not only understanding the functions of single isolated components, but also revealing the interactions and functional relationships between different components. Often, the outcome of large-scale measurements is conveniently represented in a structured form; prominent examples are protein-protein interaction networks, coexpression networks for genes, and bipartite graphs of associations between experimental conditions and regulated genes. This thesis presents different methods that aim at finding interesting patterns in such data. The main contributions are as follows. First, an exact enumerative approach to dense cluster detection is proposed. Given a weighted interaction network and a default weight for missing edges, the density of a node set is defined as the average pairwise interaction weight. The described method finds all patterns that satisfy a user-defined minimum density threshold. Conceptually, this task is a generalization of clique search; however, the standard techniques to solve that problem are not appropriate for the generalized question. Fortunately, an efficient enumeration strategy can be achieved by adopting the reverse search paradigm. Remarkably, the same algorithmic framework is applicable to discover cluster patterns in other types of structured data, like asymmetric binary relations and multipartite graphs, as well as hypergraphs, n-ary relations, and tensors. Second, our approach integrates additional constraints in order to focus the search on clusters that are relevant for the specific application at hand. For example, if each node in a network has an annotation profile attached to it, we can identify dense clusters where the nodes share a common subprofile. The principal idea is that the user provides the datasets of interest and defines desired properties of patterns with respect to them, and the method yields all solutions that match these criteria. This allows to jointly explore network data and background information in a systematic way. Third, we devise dense cluster detection approaches that sacrifice completeness of the solution set in favor of efficiency. Here, two different directions are pursued. On the one hand, we use the search strategy of the enumeration methods and introduce heuristic pruning rules to speed up the procedure. On the other hand, we propose generalizations of agglomerative hierarchical clustering for bipartite data. They detect dense clusters by successive “greedy” merging of instance sets. Consequently, this strategy and the complete enumeration approach can be seen as opposite extremes of dense cluster detection algorithms for structured data. However, both methods are very transparent with respect to the properties of the discovered set of patterns and thereby facilitate the interpretation of results. The presented algorithmic approaches are illustrated with a number of real-world applications in systems biology. They involve multiple types of genomic datasets and relate to different representative organisms, primarily yeast, human, and the plant A. thaliana. One scenario is protein complex prediction from experimental interaction data, with optional constraints from background data; the latter allow to discover context-dependent variants of complexes. Another application is the joint analysis of multiple biological networks that describe different kinds of relationships between genes, in our case transcriptional coregulation under different cellular conditions. Beyond that, we consider the detection of bicluster patterns from gene expression measurements. Finally, we show a small-scale case study on discovering associations between genomic sequence variation and transcription of genes.