Probabilistic Models in Computational Biology
PhD thesis, University of Cambridge.
Technological advances have led to an explosion in the number of avail- able biological datasets. These include measurements on a genomic scale such as extensive genotype data and the profiling of thousands of gene expression levels in large sample groups. The focus of this thesis is the design and application of probabilistic models to extract meaningful information from these data sources.
First, the focus is on models for understanding the genetic component of gene expression variation. A key insight gained is that genetic as- sociation signals can be obscured by unknown confounding influences. Accounting for these hidden effects is shown to increase the number of significant genetic associations found by up to threefold.
Second, sparse factor models that incorporate prior biological infor- mation to infer interpretable determinants of gene expression levels are investigated. A novel hybrid inference algorithm is developed to achieve efficient and accurate approximate inference. Interpretable sparse factors are identified in an application to genetic association studies.
Third, methods for modelling differential expression in microarray time series are considered. The proposed model finds temporal pat- terns of differential gene expression that reveal insights into the regu- latory processes involved in the cellular responses to external stimuli.
Fourth, a prediction framework for the thermodynamic stability of four-stranded structures that form from DNA or RNA is proposed and applied to genome-wide candidate structures in humans. This thesis is concluded with a model for robust regression of heart-rate recordings given limited sensory information.