Performing MR-SimSS

library(mr.simss)
library(dplyr)

This vignette demonstrates how this R package, mr.simss, can be used to obtain MR causal effect estimates void of Winner’s Curse bias using merely two sets of GWAS summary statistics, one for exposure and the other for outcome. We first create a toy data set and subsequently, illustrate how MR Simulated Sample Splitting (MR-SimSS) can be implemented using the mr_simss function. Various parameters of the mr_simss function are discussed in detail, while application to real data is also shown at the end of this vignette.

Creating a toy data set

In order to demonstrate how mr_simss can be applied, we first create a toy data set consisting of exposure and outcome GWAS summary statistics. We must specify certain important parameters such as the number of variants in our data set (n_snps=10^6), the exposure-outcome correlation (cor_xy=0.5), the causal effect of the exposure on the outcome (beta_xy=0.3), the sample size (n_x=n_y=200000) and the fraction of overlapping samples between the two studies (frac_overlap=0):

n_snps <- 10^6 
prop_effect <- 0.012 # proportion of total SNPs that have effect on exposure
h2 <- 0.5 # heritability of the exposure, i.e. proportion of variance explained in exposure by SNPs
n_x <- 200000; n_y <- 200000 
frac_overlap <- 0 # assume two GWASs have been performed with non-overlapping samples
cor_xy <- 0.5
beta_xy <- 0.3

The following code shows how a set of GWAS summary statistics, i.e. estimated variant-exposure associations, estimated variant-outcome associations and their corresponding standard errors \(\left( \widehat{\gamma_j}, \widehat{\sigma_{X_j}}, \widehat{\Gamma_j}, \widehat{\sigma_{Y_j}} \right)\) for each genetic variant \(j = 1, \dots, n\), can be simulated:

set.seed(1025) 
n_overlap <- frac_overlap*min(n_x, n_y) 
maf <- runif(n_snps, 0.01, 0.5) # minor allele frequencies 
effect_snps <- n_snps*prop_effect
index <- sample(1:n_snps, ceiling(effect_snps), replace=FALSE) # random sampling effect variants
beta_gx <- rep(0,n_snps)
beta_gx[index] <- rnorm(length(index),0,1) # simulating true effects 
var_x <- sum(2*maf*(1-maf)*beta_gx^2)/h2
if(var_x != 0){beta_gx <- beta_gx/sqrt(var_x)} # scaling to represent exposure with variance 1
beta_gy <- beta_gx * beta_xy
var_gx <- 1/(n_x*2*maf*(1-maf)) # exposure variance = 1
var_gy <- 1/(n_y*2*maf*(1-maf)) # outcome variance = 1
cov_gx_gy <- ((n_overlap*cor_xy)/(n_x*n_y))*(1/(2*maf*(1-maf)))
## Covariance matrix for each SNP:
cov_array <- array(dim=c(2, 2, n_snps))
cov_array[1,1,] <- var_gx
cov_array[2,1,] <- cov_gx_gy
cov_array[1,2,] <- cov_array[2,1,]
cov_array[2,2,] <- var_gy

## Simulating variant-exposure and variant-outcome association estimates
summary_stats <- apply(cov_array, 3, function(x){MASS::mvrnorm(n=1, mu=c(0,0), Sigma=x)})
beta.exposure <- summary_stats[1,] + beta_gx
beta.outcome <- summary_stats[2,] + beta_gy

## GWAS summary statistics:
data <- tibble(
    SNP = 1:n_snps,
    beta.exposure = beta.exposure,
    beta.outcome = beta.outcome,
    se.exposure = sqrt(var_gx),
    se.outcome = sqrt(var_gy)
  )

head(data)
#> # A tibble: 6 × 5
#>     SNP beta.exposure beta.outcome se.exposure se.outcome
#>   <int>         <dbl>        <dbl>       <dbl>      <dbl>
#> 1     1     -0.00468      0.00605      0.00354    0.00354
#> 2     2      0.000141     0.00223      0.00453    0.00453
#> 3     3      0.00286     -0.00884      0.00373    0.00373
#> 4     4     -0.00163      0.000722     0.00417    0.00417
#> 5     5      0.000325     0.00786      0.00730    0.00730
#> 6     6     -0.00510      0.00188      0.00324    0.00324

Implementing MR-SimSS with `mr_simss`

This package’s main function, mr_simss, requires a data frame, in a form similar to the one above, with columns SNP, beta.exposure, beta.outcome, se.exposure and se.outcome. Note that all columns of the data frame must contain numerical values and each row must represent a unique genetic variant, identified by SNP.

## Implementing mr_simss with "mr_raps" and default parameters
simss.est <- mr_simss(data,mr_method="mr_raps",threshold=5e-8,n.iter=1000,splits=3,pi=0.5,pi2=0.5,                    
                  subset=TRUE,sub.cut=0.05,
                  est.lambda=TRUE,n.exposure=1,n.outcome=1,n.overlap=1,cor.xy=0,lambda.thresh=0.5)

simss.est$summary
#>    method    nsnp niter         b          se pval
#> 1 mr_raps 251.007  1000 0.2935707 0.007639278    0
head(simss.est$results)
#>    method nsnp         b         se pval
#> 1 mr_raps  265 0.2890714 0.01751133    0
#> 2 mr_raps  245 0.2977485 0.01776773    0
#> 3 mr_raps  247 0.3005120 0.01788805    0
#> 4 mr_raps  243 0.3039359 0.01769095    0
#> 5 mr_raps  263 0.3126508 0.01735312    0
#> 6 mr_raps  259 0.2979125 0.01743181    0

As shown above, mr_simss returns a list containing two elements, summary and results:

summary is a data frame with one row which outputs b, the estimated causal effect of exposure on outcome obtained using the MR-SimSS framework, as well as se, the associated standard error of this estimate and pval, corresponding p-value. It also contains the MR method used, the average number of instrument SNPs used in each iteration and the number of iterations used. Thus, in this case we see that mr_raps has been used and the sample splitting process has been repeated 1000 times. On average, ~251.007 instrument variants are selected at each iteration and the exposure-outcome causal effect is estimated to be 0.2936 with a standard error of 0.0076.
results is a data frame which contains the output from each iteration. It is in a similar style as the output from using the function mr from the TwoSampleMR R package. The first six rows of this data frame have been outputted above.

Despite default settings existing for many of the important parameters in mr_simss, it is important that the user is aware of these parameters:

mr_method: This is a string specifying the MR method which the user wishes to be implemented within the MR-SimSS framework. Despite the user being able to choose many of the methods available in the TwoSampleMR package, we advise to use either "mr_raps" or "mr_ivw", with "mr_raps" being the preferable of the two due to its robustness to weak instruments.
threshold: This is the \(p\)-value significance threshold which is applied in the selection step of MR-SimSS, i.e. used to select instrument variants in the \(\pi\)-fraction. The default setting is the common genome-wide significance threshold of 5e-8. However, in low-power settings where very few variants are deemed significant at this threshold, we encourage users to consider adopting a less stringent threshold, such as 5e-4, while ensuring mr_method="mr_raps". This action will yield greater stability in the causal effect estimate.
n.iter: This specifies the number of iterations of the method, i.e. the number of times sample splits are randomly simulated. The default setting is n.iter=1000, which has been chosen to guarantee stability and reduce variability in the final MR estimate.
splits: This parameter takes values 2 or 3, indicating whether MR-SimSS should take a 2-split or 3-split approach (see ‘MR-SimSS: The algorithm’ for further details). The default setting is splits=3 as the 3-split approach is advised in all cases where sample overlap is potentially non-zero. When it is known for certain that the two GWAS samples are non-overlapping, then splits=2 can be used.
pi: This is the fraction of the first split in both the 2 and 3 split approaches, i.e. the fraction used for instrument selection. The default setting is pi=0.5.
pi2: This is the fraction of the second split in the 3 split approach. Similarly, the default setting is pi2=0.5.

Comparing MR-SimSS with classical Two-Sample MR methods

Below, we provide a brief illustration of the difference in using the classical inverse variance weighted (IVW) estimator on its own versus within the MR-SimSS framework:

## SimSS-IVW
res.simss.ivw <- mr_simss(data,mr_method="mr_ivw")$summary
res.simss.ivw
#>   method    nsnp niter         b          se          pval
#> 1 mr_ivw 250.887  1000 0.2771938 0.007798459 1.003681e-276

## Classical IVW
data_sig <- data %>% dplyr::filter(2*(stats::pnorm(abs(data$beta.exposure/data$se.exposure), lower.tail=FALSE)) < 5e-8)
res.ivw <- summary(stats::lm(data_sig$beta.outcome ~ -1 + data_sig$beta.exposure, weights = 1/data_sig$se.outcome^2))
res.ivw <- data.frame(method="mr_ivw", nsnp=nrow(data_sig), b=res.ivw$coef[1,1], se=res.ivw$coef[1,2]/min(1,res.ivw$sigma))
res.ivw$pval <- 2*stats::pnorm(abs(res.ivw$b/res.ivw$se), lower.tail=FALSE)
res.ivw
#>   method nsnp         b          se pval
#> 1 mr_ivw  916 0.2727558 0.004862308    0

A similar comparison is repeated for MR-RAPS:

## SimSS-RAPS
res.simss.raps <- mr_simss(data,mr_method="mr_raps")$summary
res.simss.raps
#>    method   nsnp niter         b          se         pval
#> 1 mr_raps 250.92  1000 0.2943309 0.008132926 8.73199e-287

## Classical MR-RAPS
data_sig <- data %>% dplyr::filter(2*(stats::pnorm(abs(data$beta.exposure/data$se.exposure), lower.tail=FALSE)) < 5e-8)
res.raps <- mr.raps::mr.raps(data_sig$beta.exposure,data_sig$beta.outcome,data_sig$se.exposure,data_sig$se.outcome)
res.raps <- data.frame(method="mr_raps", nsnp=nrow(data_sig), b=res.raps$beta.hat, se=res.raps$beta.se, pval=res.raps$beta.p.value)
res.raps
#>    method nsnp         b          se pval
#> 1 mr_raps  916 0.2783524 0.004982773    0

The following plot summarizes the results obtained above:

In our simulated data set, we have frac_overlap = 0 and so, bias due to weak instruments and Winner’s Curse will both be in the direction of the null (see ‘Winner’s Curse and weak instrument bias in MR’ for detailed explanation). This bias is illustrated clearly in the plot above as both classical approaches, IVW and MR-RAPS, are seen to provide causal effect estimates less than the true value of 0.3: 0.2728 and 0.2784, respectively. Unfortunately, downward bias is still evident with SimSS-IVW as it yields a causal effect estimate of 0.2772. This is observed due to the nature of sample splitting, i.e. it is possible that additional weak instrument bias is introduced as the effective sample sizes used for estimation are reduced. Therefore, we recommend incorporating a two-sample MR method that is robust to weak instruments, such as MR-RAPS, into the MR-SimSS framework. The benefit of doing so is depicted in the plot as SimSS-RAPS clearly performs the best out of all four approaches, providing a causal effect estimate of 0.2943. Bias is reduced to -0.0057, a 73.61% improvement on the classical MR-RAPS implementation. Furthermore, the true causal effect of 0.3 is now included in the 95% confidence interval.

Additional `mr_simss` parameters

\(\star\) In the vignette ‘MR-SimSS: The algorithm’, we briefly mention a computationally efficient feature of our approach which involves establishing a subset of genetic variants. This subset is constructed to contain variants that are considered more likely to pass the selection threshold or to be chosen as instruments in the main component of the MR-SimSS algorithm. In theory, this subsetting process ensures that for a single iteration of MR-SimSS, the number of genetic instruments if the full set of variants were included and the number of instruments if merely this subset is used will be equal with at least 95% probability.

The subset parameter in mr_simss function specifies whether the above process should be executed or not. Naturally, the default setting is subset=TRUE.
The sub.cut parameter, which defaults to sub.cut=0.05, specifies the probability mentioned above, i.e. it ensures that the two sets of genetic instruments will be equal in number with probability at least 1-sub.cut.

\(\star\) In ‘MR-SimSS: The algorithm’, it is also mentioned that in addition to variant-exposure and variant-outcome association estimates and corresponding standard errors, the main MR-SimSS algorithm requires a value for the correlation parameter \(\lambda = \frac{N_{\text{overlap}}\rho}{\sqrt{N_X N_Y}}\). Because of this, this R package also contains a supplementary function, est_lambda, which allows users to obtain an unbiased estimate for \(\lambda\), the correlation between the variant-outcome and variant-exposure effect sizes, using a conditional log-likelihood approach. It is recommended to use this function by specifying est_lambda=TRUE in mr_simss when the fraction of sample overlap and the correlation between exposure and outcome are unknown.

Note: For est_lambda to obtain an estimate for \(\lambda\), it is important that the set of supplied GWAS summary statistics, i.e. data in mr_simss, contain information for a large set of genome-wide variants, and not just those considered significant or relevant with respect to the exposure of choice.

The est_lambda parameter in mr_simss function specifies whether \(\lambda\) should be estimated using the aforementioned conditional log-likelihood approach contained in the est_lambda function. The default setting is est_lambda=TRUE.
The lambda.thresh parameter, which defaults to lambda.thresh=0.5, is used by est_lambda function to obtain a subset of variants which have absolute z-statistics for both exposure and outcome GWASs less than this value, e.g. 0.5. The estimation process then assumes that both of the true variant-outcome and variant-exposure effect sizes of each variant in this subset are approximately 0.

If the user doesn’t wish for the est_lambda function to be used, then values for the number of samples in both exposure and outcome GWASs (n.exposure,n.outcome), the number of overlapping samples between the two studies (n.overlap) and the correlation between the exposure and the outcome (cor.xy) must be supplied to mr_simss.

Runtime: Despite MR-SimSS being a simulation approach, an important aspect of our work has focused on making the mr_simss function as computationally efficient as possible while retaining accurate effect estimation. SimSS-RAPS above took 9.36 seconds to run.

Performing `mr.simss` with real data

Same-trait empirical analysis with `TwoSampleMR` package

For illustrative purposes, we conduct a same-trait MR analysis and estimate the effect of high-density lipoprotein cholesterol (HDL) on itself using two independent HDL GWAS datasets, one from UK Biobank and the other from the Global Lipids Genetics Consortium. The sample sizes for the two studies are \(N_\text{UKBB} = 403,943\) and \(N_\text{GLGC} = 94,595\), respectively. In this setting, the true causal effect is known to be 1, thus providing a useful benchmark for us. We demonstrate below how this MR analysis can be conducted by employing the mr_simss R package together with TwoSampleMR and other related R packages.

Note: As previously mentioned, mr.simss requires a genome-wide set of approximately independent variants to effectively estimate the correlation parameter, \(\lambda\). For this reason, it cannot be implemented as straightforwardly as other MR methods contained in the TwoSampleMR R package, i.e. by simply applying it to a clumped set of genome-wide significant variants. To overcome this issue, we obtained a genome-wide set of independent variants via LD pruning using the PLINK 2.0 command indep-pairwise 50 5 0.5. This set of variants can be downloaded from here, as shown below.

The following code demonstrates our workflow for obtaining required data files of GWAS summary statistics and constructing a suitable data frame that mr_simss can be applied to.

Workflow:

First, raw VCF files for both exposure (UKBB HDL) and outcome (GLGC HDL) are downloaded from https://opengwas.io/datasets/ieu-b-109 and https://opengwas.io/datasets/ebi-a-GCST002223, respectively, and stored in data folder.

## Load required packages:
library(TwoSampleMR)
library(gwasglue)
library(gwasvcf)
library(ieugwasr)

## HDL 1 - ieu-b-109 (UK Biobank n=403,943)
HDL1.raw <- gwasvcf::query_gwas("data/ieu-b-109.vcf.gz",pval=1) 
HDL1.EXP <- gwasglue::gwasvcf_to_TwoSampleMR(HDL1.raw, type="exposure")

## HDL 2 - ebi-a-GCST002223 (Global Lipids Genetics Consortium n=94,595)
HDL2.raw <- gwasvcf::query_gwas("data/ebi-a-GCST002223.vcf.gz",pval=1)
HDL2.OUT <- gwasglue::gwasvcf_to_TwoSampleMR(HDL2.raw, type="outcome")

## Download set of pruned variants
pruned_SNPs <- read.table('https://raw.githubusercontent.com/amandaforde/winnerscurse_MR/refs/heads/main/pruned_SNPs.txt',header=TRUE)

HDL.EXP.pruned <- HDL1.EXP[HDL1.EXP$SNP %in% pruned_SNPs$V1,]
HDL.OUT.pruned <- HDL2.OUT[HDL2.OUT$SNP %in% pruned_SNPs$V1,]

## Harmonise - ensure variant-exposure effect and variant-outcome effect correspond to same allele
MR_data <- TwoSampleMR::harmonise_data(HDL.EXP.pruned, HDL.OUT.pruned)
MR_data <- MR_data[MR_data$mr_keep == TRUE,]

## Construct reduced data frame suitable for mr_simss implemenation
simss_data <- data.frame(SNP=MR_data$SNP, beta.exposure=MR_data$beta.exposure, se.exposure=MR_data$se.exposure,beta.outcome=MR_data$beta.outcome,se.outcome=MR_data$se.outcome)

## SimSS-RAPS (2-split - zero sample overlap)
SimSS3_RAPS <- mr.simss::mr_simss(data=simss_data, mr_method="mr_raps", splits=2) 
SimSS3_RAPS$summary
#>    method    nsnp niter         b          se pval
#> 1 mr_raps 808.816  1000 0.9848507 0.006163889    0

## Classical two-sample MR methods
MR_select <- MR_data[MR_data$pval.exposure < 5e-8,] # genome-wide significance threshold
TwoSampleMR::mr(MR_select,method_list=c("mr_ivw","mr_weighted_median","mr_weighted_mode","mr_egger_regression"))[,5:9]
#> Analysing 'uVQjLP' on 'cQZoF5'
#>                      method nsnp         b          se pval
#> 1 Inverse variance weighted 1730 0.9545393 0.006298892    0
#> 2           Weighted median 1730 0.9484012 0.011251559    0
#> 3             Weighted mode 1730 0.9556311 0.015565350    0
#> 4                  MR Egger 1730 1.0173895 0.010274205    0

## Classical MR-RAPS
res_RAPS <- mr.raps::mr.raps(MR_select$beta.exposure, MR_select$beta.outcome, MR_select$se.exposure, MR_select$se.outcome)
data.frame(method=c("MR RAPS"),nsnp=c(nrow(MR_select)),b=c(res_RAPS$beta.hat),se=c(res_RAPS$beta.se),pval=c(res_RAPS$beta.p.value))
#>    method nsnp         b          se pval
#> 1 MR RAPS 1730 0.9642682 0.005409247    0

In the outputted results, we see that out of all six methods, SimSS-RAPS provides a causal effect estimate closest to the true value of 1 (0.9849). Furthermore, its standard error of 0.0062 is of comparable magnitude to that of both the classical IVW and MR-RAPS approaches.

Creating a toy data set

Implementing MR-SimSS with mr_simss

Comparing MR-SimSS with classical Two-Sample MR methods

Additional mr_simss parameters

Performing mr.simss with real data

Same-trait empirical analysis with TwoSampleMR package

Implementing MR-SimSS with `mr_simss`

Additional `mr_simss` parameters

Performing `mr.simss` with real data

Same-trait empirical analysis with `TwoSampleMR` package