PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

A Bayesian framework to account for complex non-genetic factors in gene expression data greatly increases power in eQTL studies .
Oliver Stegle, Leopold Parts, Richard Durbin and John Winn
PLoS Computational Biology Volume 6, Number 5, 2010.

Abstract

Introduction DNA microarray technologies allow for quantification of expression levels of thousands of loci in the genome. These measurements enable exploring how a variable, such as clinical phenotype, tissue type, or genetic background, affects the transcriptional state of the sample. Recently, gene expression levels have been studied as quantitative genetic traits, investigating the effect of genotype as the primary variable. Studies have found and characterised large numbers of expression quantitative trait loci (eQTLs) [1–3], exploring their complexity [2], population genetics [4,5] and associations with disease [6,7]. An important issue in such studies is additional variation in expression data that is not due to the genetic state, as illustrated in Figure 1. Intracellular fluctuations, environmental conditions, and experimental procedures are factors that all can have a strong effect on the measured transcript levels [2,8–10] and thereby obscure the association signal. When measured, correct estimation of the additional variation due to these known factors allows for a more sensitive analysis of the genetic effect. For example, it has been reported that additional human eQTLs can be found when including the known factors of age, and blood cell counts in the model [7]. It is also standard procedure to correct for batch effects, such as image artefacts or sample preparation differences [11]. In practise it is not possible to measure or even be aware of all potential sources of variation, but nevertheless it is important to account for them. Unobserved, hidden factors, such as cell culture conditions [12] often have an influence on large numbers of genes. We and others have proposed methods to detect and correct for such effects [9,13,14]. These studies demonstrated the importance of accounting for hidden factors, yielding a stronger statistical discrimination signal. The challenge in modelling several confounding sources of variation (Figure 1) is to correctly estimate the contribution that is due to each one of them. There are open questions how to ensure that only spurious signal is eliminated by methods that account for hidden factors (see for instance discussion in [14]), and how to deal with situations when both known and hidden factors are present. The problem of identifying the correct causes of the signal is even harder in the presence of additional sources of variability. For Abstract Gene expression measurements are influenced by a wide range of factors, such as the state of the cell, experimental conditions and variants in the sequence of regulatory regions. To understand the effect of a variable of interest, such as the genotype of a locus, it is important to account for variation that is due to confounding causes. Here, we present VBQTL, a probabilistic approach for mapping expression quantitative trait loci (eQTLs) that jointly models contributions from genotype as well as known and hidden confounding factors. VBQTL is implemented within an efficient and flexible inference framework, making it fast and tractable on large-scale problems. We compare the performance of VBQTL with alternative methods for dealing with confounding variability on eQTL mapping datasets from simulations, yeast, mouse, and human. Employing Bayesian complexity control and joint modelling is shown to result in more precise estimates of the contribution of different confounding factors resulting in additional associations to measured transcript levels compared to alternative approaches. We present a threefold larger collection of cis eQTLs than previously found in a whole-genome eQTL scan of an outbred human population. Altogether, 27% of the tested probes show a significant genetic association in cis, and we validate that the additional eQTLs are likely to be real by replicating them in different sets of individuals. Our method is the next step in the analysis of high-dimensional phenotype data, and its application has revealed insights into genetic regulation of gene expression by demonstrating more abundant cis-acting eQTLs in human than previously shown. Our software is freely available online at http://www.sanger.ac.uk/resources/software/peer/.

EPrint Type:Article
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
ID Code:8070
Deposited By:Oliver Stegle
Deposited On:17 March 2011