Adjusting for genetic confounders in transcriptome-wide association studies improves discovery of risk genes of ...

29 Jan 2024

Nature

Share on XTweet

Adjusting for genetic confounders in transcriptomewide association studies
improves discovery of risk genes of

Causal-TWAS (cTWAS) is a statistical framework that adjusts for genetic confounders in transcriptome-wide association studies. Application of cTWAS on common traits leads to reliable detection of candidate causal genes.

Model of individual-level data

Let y be the quantitative phenotype, assumed to be standardized, of an individual. We assume that y depends on imputed gene expressions and variant genotypes of the individual. We denote Xj the expression of the gene j, $\tilde{{X}_{j}}$, as its cis-genetic component, and Gm the genotype of the variant m. We assume that ${\tilde{X}}_{j}$ is given, imputed from a pretrained expression prediction model, and the imputation errors/uncertainty would be ignored. We have the following regression model:

$$y=\mathop{\sum}\limits_{j}{\beta }_{j}\tilde{{X}_{j}}+\mathop{\sum}\limits_{m}{\theta }_{m}{G}_{m}+\epsilon ,$$

(1)

where βj and θm are the effect sizes of gene expression j and the variant m, respectively. ϵ is a normally distributed error term, that is, ϵ ∼ N(0, σ2), and is assumed to be independent across individuals. In practice, we standardize both $\tilde{{X}_{j}}$ and Gm to make the variance equal to 1 for all the genes and variants.

To obtain the imputed expressions, we use existing expression prediction models. Specifically, the imputed expression of a gene j is defined as ∑lwjlGl, where Gl is the genotype of variant l, and wjl is the weight of the lth variant in gene j’s expression prediction model. We assume that these weights are given at the standardized scale, that is, the weights were derived using standardized variant genotypes. This is the case for the FUSION expression models (http://gusevlab.org/projects/fusion/). When the provided weights are not on the standardized scale, for example, from PredictDB (https://predictdb.org/), these weights must be scaled. This can be done by multiplying the weights by genotype variances from the LD reference.

We specify different prior distributions of gene effects βj’s, and variant effects θm’s. To describe these priors, we note that our model is a special case of a more general regression model, where explanatory variables come from multiple groups with different distributions of effect sizes.

We write the general model with K groups of explanatory variables as

$$y=\mathop{\sum }\limits_{k=1}^{K}\,\mathop{\sum}\limits_{j\in {M}_{k}}{\beta }_{j}{X}_{j}+\epsilon ,$$

(2)

where Xj is jth explanatory variable and j ∈ Mk denotes that it belongs to group k. In our case, the model has two groups of variables, imputed gene expressions and genetic variants. For simplicity of notation, we will use this general model in our following discussions. We assign a spike-and-slab prior distribution for the effect of variable j, with group-specific prior parameters. Specifically, when j ∈ Mk, we denote γj an indicator of whether Xj has nonzero effect

$$\begin{array}{rcl}{\gamma }_{j}& \sim &\,{{\mbox{Bernoulli}}}\,({\pi }_{k})\\ {\beta }_{j}| {\gamma }_{j}=1& \sim &N(0,{\sigma }_{k}^{2})\\ {\beta }_{j}| {\gamma }_{j}=0& \sim &{\delta }_{0}.\end{array}$$

(3)

Here δ0 is the Dirac’s delta function, πk = P(γj = 1∣j ∈ Mk) is the prior probability of the jth variable from group k being casual to the trait (nonzero effect) and ${\sigma }_{k}^{2}$ is the prior variance of the effect size of causal variables in the group k.

Inference of the individual-level model

The inference has two main steps. In the first step, we estimate the prior parameters $\bf{\theta} =\{{\pi }_{k},{\sigma }_{k}^{2},k\in \{1,2\}\}$ for the two groups, gene effects and variants effects. In the second step, we use the estimated θ, and compute the PIP of each variable, defined as the posterior probability of γj = 1 given all the data and parameters.

The parameter estimation is done by maximum likelihood. Let yn×1 be the data of the response variable, where n is the sample size. Let Xn×p = [X1X2…Xp] be the data of all the p explanatory variables. The likelihood of our model is given by

$$L(\bf{\theta} ;\bf{X},\bf{y},\sigma )=P(\bf{y}| \bf{X},\bf{\theta} ,\sigma )=\mathop{\sum}\limits_{{\bf{\Gamma}}}P(\bf{y}| \bf{X},\bf{\Gamma},\theta ,\sigma )P(\bf{\Gamma }| \bf{\theta} ),$$

(4)

where Γ = [γ1, γ2, …, γp] represents the ‘configuration’ of the causal (nonzero effect) status of all variables. We note that σ is the standard deviation of the phenotypic variance, and is assumed to be given (see below). To maximize the likelihood, we use the expectation-maximization (EM) algorithm. In the E-step, we obtain the expectation of log-likelihood over Γ, ${{\mathbb{E}}}_{{{{\bf{\Gamma }}}}}\log P({{{\bf{X}}}},{{{\bf{y}}}},{{{\bf{\Gamma }}}}| \bf{\theta }^{(t)},\sigma )$, where θ(t) is the parameter value in the t-th iteration. In the M-step, we update θ(t) using the following rules to maximize the expectation from the E-step (Supplementary Note):

$${\pi }_{k}^{(t+1)}=\frac{1}{| \bf{M}_{k}| }\mathop{\sum}\limits_{j\in \bf{M}_{k}}{\alpha }_{j}^{(t)}$$

(5)

$${\sigma }_{k}^{2,(t+1)}=\frac{{\sum }_{j\in \bf{M}_{k}}{\alpha }_{j}^{(t)}\cdot {\tau }_{j}^{2,(t)}}{{\sum }_{j\in \bf{M}_{k}}{\alpha }_{j}^{(t)}},$$

(6)

where ∣Mk∣ is the number of variables in group k, ${\alpha }_{j}^{(t)}=P({\gamma }_{j}=1| \bf{X},\bf{y},{\bf{\theta} }^{(t)},\sigma )$ is the PIP of variable j given data and current parameter values θ(t) and ${\tau }_{j}^{2,(t)}={\mathbb{E}}({\,\beta }_{j}^{2}| {\gamma }_{j}=1,\bf{X},\bf{y},{\bf{\theta}}^{(t)},\sigma )$ is the second moment of the posterior effect size of variable j, given that it is a causal variable. The updated rules have simple interpretations. The new parameter ${\pi }_{k}^{(t+1)}$ is simply the average PIP of all variables in the group k and the new ${\sigma }_{k}^{2,(t+1)}$ is the weighted average of the second moment of the posterior effect sizes.

Computing αj and ${\tau }_{j}^{2}$ at the t-th iteration (we removed superscript t from now on for simplicity) using all variables in the genome is computationally challenging. To reduce the computational burden, we divide the genome into LD blocks using LDetect51 with variants approximately independent between blocks. We assign a gene to an LD block if all SNPs in its expression prediction model fall into that block. If the variants of the prediction model of any gene span multiple LD blocks, we merge all such blocks into a new block. We will then compute αj and ${\tau }_{j}^{2}$ of the variables in each block independently, while still using all variables in the genome to update the parameters using Eqs. (5) and (6).

Even within a single block, there may still be hundreds to thousands of variables. This makes it difficult to compute αj and ${\tau }_{j}^{2}$, as it requires marginalization of Γ. To address this challenge, we first notice that our problem is now reduced to standard fine-mapping or Bayesian variable selection problem, with different prior distributions of the effects of different variables. Therefore, we borrow from fine-mapping literature to compute αj and ${\tau }_{j}^{2}$ (refs. 19,20; see Supplementary Note for details).

After we estimate the prior parameters, we apply SuSiE19, a fine-mapping method, on all variables, including both genes and variants, in each block. Note that all blocks, including the large blocks pruned in the parameter estimation step, will be analyzed. In applying SuSiE, we set the prior probability and prior effect variance of each variable, using the estimated parameters of the group (genes or variants) that this variable belongs to. We allow multiple causal variables by setting L = 5 in SuSiE and assigning null weight as 1 − ∑jpj. SuSiE will then return PIPs of all genes and variants in each LD block.

Model of summary statistics

The summary data would include the effect size estimates of variants, ${\hat{\beta }_{j}}$, and their standard errors sj, as well as the LD between all pairs of variants, denoted as the matrix R. The effect sizes can be standardized, denoted as ${\hat{z}}_{j}={\hat{\beta }}_{j}/{s}_{j}$. Given that the summary data have only variant information, our model would first need to expand the summary data to include gene information. Specifically, we compute the marginal association of each imputed gene with the GWAS trait, and the correlation of any gene with all other genes and all the variants. These calculations are described in the Supplementary Note. Once computed, we will have marginal associations of all variables, including genes and variants, $\hat{{{{\bf{z}}}}}$, and their correlation matrix R. These data would be the input of our analysis.

Following the literature20,52, and particularly, the summary statistics version of SuSiE (SuSiE-RSS)20, we have the following model of $\hat{\bf{z}}$:

$$\hat{\bf{z}}| \bf{z},\bf{R} \sim \bf{N}_{p}(\bf{R}\bf{z},\bf{R}),$$

(7)

where z = (z1, z2, …, zp) denotes the ‘standardized’ true effect sizes. We use the same spike-and-slab prior for zj—when the variable j belongs to the group k

$$\begin{array}{rcl}{\gamma }_{j}& \sim &\,{{\mbox{Bernoulli}}}\,({\pi }_{k})\\ {z}_{j}| {\gamma }_{j}=1& \sim &N\left(0,{\sigma }_{k}^{2}\right)\\ {z}_{j}| {\gamma }_{j}=0& \sim &{\delta }_{0}.\end{array}$$

(8)

Again, we denote $\bf{\theta} =\{{\pi }_{k},{\sigma }_{k}^{2}\}$ the prior parameters and Γ the causal configuration. We estimate the prior parameters θ by MLE. This can be done with the same algorithm used for the individual-level model. Specifically, following SuSiE-RSS20, the likelihood function under the individual-level data can be rewritten in terms of sufficient statistics and ${s}_{j}^{2}$. Then, if we make the following substitutions, the likelihood of the individual-level model would be identical to that of the summary statistics model. Specifically, we change β = (β1, β2, …, βp) to z, XTX to R, XTy to $\hat{\bf{z}}$, yTy to 1 and n to 1. Also, the prior model of z in the summary statistics model is the same as the prior model of β in the individual-level model. Therefore, we can use the same EM algorithm and the update rules to estimate θ. The update rules follow Eqs. (5) and (6), where the PIP of variable j is now defined as ${\alpha }_{j}^{(t)}=P({\gamma }_{j}=1| \hat{\bf{z}},\bf{R},\bf{\theta }^{(t)})$ and the second moment of the posterior effect ${\tau }_{j}^{2,(t)}={\mathbb{E}}({z}_{j}^{2}| {\gamma }_{j}=1,\hat{\bf{z}},{{{{\bf{R}}}}},\bf{\theta }^{(t)})$.

Once the parameters were estimated, we followed the same procedure as in the individual-level model to obtain PIPs of all variables, except that SuSiE-RSS is used in fine-mapping.

Estimating proportions of phenotypic variance explained by variants and genes

We assume that all the explanatory variables and the response variable in the regression model are standardized, with a variance equal to 1. Then the proportion of variance explained (PVE) by a single variable, j, is simply ${\beta }_{j}^{2}\cdot {{{\rm{Var}}}}(\bf{X}_{j})/{{{\rm{Var}}}}(\bf{y})={\beta }_{j}^{2}$. Assuming that we use the z scores in the summary statistical model, the effect size is related to z score by ${\beta }_{j}={z}_{j}/\sqrt{n}$, where n is the sample size. So, on average, the PVE of a variable in group k (variant or gene) is ${\mathbb{E}}({z}_{j}^{2})={\sigma }_{k}^{2}/n$, where σk is the prior variance of effect size in the group, k, at the z-score scale. The expected number of variables with nonzero effects in the group k is πk ⋅ ∣Mk∣, where πk is the prior inclusion probability and ∣Mk∣ is the group size. Putting this together, the PVE by the group k is given by

$${{{{\rm{PVE}}}}}_{k}={\sigma }_{k}^{2}\cdot {\pi }_{k}\cdot | \bf{M}_{k}| \cdot {n}^{-1}.$$

(9)

This equation is used to compute PVE from estimated parameters using both simulated and real data.

Simulation procedure

In our simulations, we used the following data: (1) genotype data. We used genotype data from UK Biobank by randomly selecting 80,000 samples. We then filtered samples to only keep ‘White British’, removed samples with missing information, mismatches between self-reported and genetic sex or ‘outliers’ as defined by UK Biobank. We also removed any individuals who have close relatives in the cohort. This ended up with a cohort of n = 45,087 samples. We used SNPs from chromosome (chr) 1 to chr 22 and selected those with a minor allele frequency of >0.05 and at least 95% genotyping rate. After filtering, 6,228,664 SNPs remained and were used in our analysis. (2) Gene expression prediction models. We used GTEx v7 Adipose tissue dataset. This dataset contains 8,021 genes with expression models. We used the LASSO weights from the FUSION website (http://gusevlab.org/projects/fusion/).

We first impute gene expression for all samples using the prediction models. SNP genotypes are harmonized between the expression prediction model and UK Biobank genotypes so that the reference and alternate alleles match. SNPs in the FUSION prediction models but not in UK Biobank, about 13% of all, were not used in imputing gene expression. We then sample the causal genes and SNPs under given prior inclusion probabilities πk’s and then sample their effect sizes accordingly using the prior variance parameter ${\sigma }_{k}^{2}$. We then simulate y under the model defined in Eq. (1). The prior parameters ${\pi }_{k},{\sigma }_{k}^{2}$ were chosen to reflect the genetic architecture in real data. In particular, it was estimated that gene expression mediates about 10–20% of trait heritability25. And the studies using rare variants for complex traits suggested that about 5% of protein-coding genes are likely causal53. Given these considerations, we set the prior probability for SNPs to 10−4 or 2.5 × 10−4, and PVE of SNPs to 0.3 or 0.5. For the genes, we set the prior probability to 0.015 or 0.05 and PVE of genes from 0.02 to 0.1.

To test if our method is robust to mis-specified priors for causal gene effect, we have also simulated causal gene effect under the mixture of normal distributions. For the mixture of normal distributions, we used equal mixtures of four normal distributions, each with mean 0 and standard deviations with ratios of 1:2:4:8. That is for gene j, its prior distribution of causal effect size follows: ${\beta }_{j}| {\gamma }_{j}=1 \sim {\sum }_{\omega \in [1,2,4,8]}{\pi }^{{\prime} }$(N(0,$\omega {\sigma }^{{\prime} 2}$). The prior probability being a casual gene is, therefore, $4{\pi }^{{\prime} }$ and causal effect size variance is $15{\sigma }^{{\prime} 2}$. The prior probability of being a casual gene and the PVE of genes were set to values as described above.

To run cTWAS, we performed the association of individual SNPs with the trait y, to obtain summary statistics of SNPs ${\hat{\bf{z}}}_{\rm{SNP}}$. We randomly selected 2,000 samples from the cohort to calculate SNP genotype correlation matrix or LD matrix ${\hat{\bf{R}}}_{{{{\rm{SNP}}}}}$. We then ran cTWAS summary statistics version under each simulation setting with ${\hat{\bf{z}}}_{\rm{SNP}}$, ${\hat{\bf{R}}}_{{{{\rm{SNP}}}}}$ and expression prediction models as input. The software will harmonize SNP genotypes for ${\hat{\bf{z}}}_{\rm{SNP}}$, ${\hat{\bf{R}}}_{{{{\rm{SNP}}}}}$ and expression prediction models, so that the reference and alternate allele match. To further reduce the computational burden in estimating parameters, we only used one in every ten SNPs (SNP thinning) in the EM algorithm. When calculating PIP, we first run SuSiE-RSS with L = 5 in each LD Block with thinned SNPs. For each block with maximum gene PIP > 0.8, we rerun SuSiE-RSS with L = 5 with the original SNPs to get the final gene PIPs.

GWAS summary statistics

The LDL and SBP summary statistics were from the UK Biobank, computed by the Rapid GWAS project31 using Hail54. These summary statistics were downloaded from the IEU OpenGWAS project55 using GWAS IDs ‘ukb-d-30780_irnt’ (LDL) and ‘ukb-a-360’ (SBP). Both LDL and SBP summary statistics were based on the White British subpopulation of the UK Biobank, with sample sizes of n = 343,621 and n = 317,754, respectively. The IBD summary statistics were from the International IBD Genetics Consortium56, computed by meta-analysis using METAL57. These summary statistics were obtained from IEU OpenGWAS using GWAS ID ‘ebi-a-GCST004131’. IBD includes cases of both Crohn’s disease and ulcerative colitis. The IBD summary statistics were based on nonoverlapping samples of European ancestry with a combined sample size of n = 59,957. The SCZ summary statistics were from the Psychiatric Genetics Consortium and the CardiffCOGS study58, computed by meta-analysis using METAL57. These summary statistics were obtained from the authors via the link provided in the manuscript. The SCZ summary statistics were based on nonoverlapping samples of primarily European ancestry with a combined sample size of n = 105,318.

LD reference data

We computed the LD reference panel of common biallelic variants using the White British subpopulation of the UK Biobank. This panel is an in-sample reference for GWAS summary statistics from the Rapid GWAS project31. First, we used plate and well information from the genotyping to unambiguously identify over 99% (357,654 of 361,194) of the samples used in the Rapid GWAS project in our data. To ease computation, we randomly selected 10% of these samples to serve as the LD reference panel59. We also limited our panel to common autosomal variants with MAF > 0.01 in the Rapid GWAS analyses. Then, we computed correlations between all pairs of variants within each of 1,700 approximately independent regions. These regions are assumed to have low LD between them and are based on previously identified regions51 that could be lifted over from hg37 to hg38 positions. The final LD reference panel consists of 1,700 correlation matrices and contains 9,309,375 variants. This LD reference panel was used when analyzing all traits, including those that were not measured in the White British subpopulation of the UK Biobank.

Harmonization of GWAS data and expression prediction models to LD reference. We restricted our analyses to variants that were non-missing in the GWAS summary statistics, expression prediction models, and LD reference panel. To ensure consistency between these three datasets, we performed two harmonization procedures. The objective of harmonization was to ensure that the reference and alternate alleles of each variant are defined consistently across all three datasets20. In our case, we must harmonize both the GWAS z-scores and the eQTL prediction models to our LD reference, and we use a different harmonize procedure for each. These procedures are based in part on previous work33. To describe the two procedures, it is necessary to define several cases of inconsistencies that can occur in either dataset. The first case is a variant with its reference and alternate alleles ‘flipped’ with respect to the LD reference. The GWAS z scores or eQTL weights in the prediction model of the flipped variants should have their signs reversed to be consistent with the LD reference. The second case is a variant that has had its strand ‘switched’ with respect to the LD reference (for example, variant is G/A in the LD reference but C/T in the other dataset). In this case, the reference and alternate alleles are the same, just named using different strands. The z scores or weights of switched variants should not be changed, as they are already consistent with the LD reference. The third case is a variant that is ‘ambiguous’ as to whether it is flipped or switched. This occurs when the two alleles of a variant are also complementary base pairs (A ↔ T substitutions or G ↔ C). For example, let us consider a variant that is A/T in the LD reference and T/A. It is unclear when this variant is flipped or switched with respect to the LD reference (both result in T/A), and it is ambiguous as to whether the signs of the z scores or weights should be reversed. We say that variants are ‘unambiguous’ when they do not involve substitutions of complementary base pairs.

To harmonize the z scores from GWAS summary statistics, we first identified all inconsistencies in reference and alternate alleles between the z scores and the LD reference. Next, we resolved all unambiguous cases of flipped and switched alleles, reversing the sign of z scores that were flipped and taking no action for switched alleles. Then, we imputed the z scores for ambiguous variants using all unambiguous variants in each of the LD regions60. If the sign of the imputed z score did not match the sign of the observed z score, we used the sign of the imputed z score, reversing the sign of the observed z score. Note that we did not perform the procedure to resolve ambiguous variants when analyzing LDL or SBP, as both the summary statistics and LD reference panel are derived from UK Biobank data.

To harmonize the eQTL prediction models, we first identified all inconsistencies in reference and alternate alleles between the prediction models and the LD reference. Next, we resolved all unambiguous cases of flipped and switched alleles, reversing the sign of weights that were flipped and taking no action for switched alleles. These steps to resolve unambiguous variants are the same as in the z-score harmonization procedure. To resolve ambiguous variants, we leveraged correlations between ambiguous and unambiguous variants in both our LD reference panel and the LD panel used to construct the PredictDB models. PredictDB reports the covariance between pairs of variants within each gene prediction model. For gene prediction models that include both ambiguous variants and unambiguous variants, we computed the sum of correlations between each ambiguous variant and the unambiguous variants in the prediction model, using both our LD reference panel and the LD used for the prediction models. If the sign of the total correlation in the LD reference of the prediction models did not match the sign of the total correlation in our LD reference panel, we reversed the sign of the prediction model weights for the ambiguous variant. If the total correlation in the LD reference was equal to zero, then we set the weight of the ambiguous variant to zero, as these ambiguous variants did not have any unambiguous variants in the same LD region. For gene prediction models that include only a single ambiguous variant and no unambiguous variants, we left the sign of the prediction model weight unchanged; the resulting gene z score may have an incorrect sign, but the magnitude of the z score will be correct. We excluded gene prediction models with multiple ambiguous variants and no unambiguous variants, as their gene z scores could be incorrect in both sign and magnitude. Such exclusions were infrequent, affecting less than 1% of liver genes in the LDL analysis (94 of 11,502 genes with prediction models).

Performing cTWAS analysis in real data

We used the following cTWAS settings when analyzing real data. For parameter estimation, we used the default procedure for selecting the starting values of the EM algorithm. We then performed 30 iterations of the EM algorithm assuming L = 1 effect (at most a single causal effect) in each region, using variants that were thinned by 10% to reduce computation. For computing PIPs of genes and variants, we used thinned variants and assumed L = 5 (at most five causal effects) in each region. For regions with maximum gene PIP > 0.8, we recomputed PIPs using all variants, with L = 5. For this final step, we allowed a maximum of 20,000 variants in a region to reduce computation; if the maximum number of variants was exceeded, we randomly selected 20,000 variants to include. Unless specified otherwise, we used the threshold PIP > 0.8 for declaring significant genes.

Evaluating methods in distinguishing silver standard and bystander genes for LDL

Following previous studies35, we assessed the performance of TWAS and cTWAS on real data by comparing their ability to distinguish LDL silver standard genes from other nearby genes. We defined a set of ‘bystander’ genes that were within 1 Mb of a silver standard gene. These bystander genes would be considered the negative set. We limited our analysis to 46 of 69 silver standard genes with imputed expression after harmonization, and the 539 imputed bystander genes that are nearby these genes. Next, we determined if these silver standard and bystander genes were significant by TWAS (Bonferroni) or cTWAS (PIP > 0.8). Then, we computed the precision of each method as follows: (number of detected silver standard genes)/(number of detected silver standard genes + number of detected bystander genes).

Classifying TWAS false-positive genes for LDL by source of confounding

To better understand how TWAS generated false-positive findings, we classified whether TWAS false positives as primarily due to confounding by variants or confounding by genes. We defined TWAS false positives as genes that were significant by TWAS (Bonferroni) but PIP r > 0.5). If a false-positive gene was not included in any credible set but was highly correlated (r > 0.5) with at least one variant or gene in a credible set, that false-positive gene was also assigned to the credible set. After assigning a total of 83 false-positive genes to credible sets, for each assigned gene, we summed the PIPs of all other genes and variants in its credible set to obtain total PIPs for confounding genes and variants. If the total gene PIP was higher than that of the variants, we classified the gene as confounded by genes, otherwise, confounded by variants.

Summarizing cTWAS results using tissue groups

To aid the interpretation of cTWAS findings, we grouped related tissues into ‘tissue groups’ and summarized the findings within these groups. We used previously defined tissue groups that assigned 37 of 49 tissues to one of 7 tissue groups25. We then took the union of genes detected at PIP > 0.8 in any tissue within each tissue group, and we used these combined lists of detected genes for downstream analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.