We generate individual-level gene expression levels and phenotypes based on 503 non-overlapping Europeans from 1000 Genomes (1KG) phase 3 (https://www.cog-genomics.org/plink/2.0/resources). We simulate a risk region including 950 cis-SNPs and 5 candidate genes. Each gene contains 100 cis-SNPs, including 2 eQTLs explaining 20% variation of gene expression levels. Only the first gene is causal to the trait of interest, explaining 50% of total heritability, which is set to 0.2. The remaining heritability is explained by 5 non-mediated causal SNPs.
We calculate the summary statistics (z-scores) for each cis-SNP in the risk region based on simulated phenotype and in-sample genotype data. For this example, we used the in-sample genotype to replace the reference panel.
str(data)
#> List of 14
#> $ snpidx :List of 5
#> ..$ : int [1:100] 111 112 113 114 115 116 117 118 119 120 ...
#> ..$ : int [1:100] 263 264 265 266 267 268 269 270 271 272 ...
#> ..$ : int [1:100] 436 437 438 439 440 441 442 443 444 445 ...
#> ..$ : int [1:100] 605 606 607 608 609 610 611 612 613 614 ...
#> ..$ : int [1:100] 740 741 742 743 744 745 746 747 748 749 ...
#> $ inde :List of 5
#> ..$ : int [1:2] 92 87
#> ..$ : int [1:2] 33 81
#> ..$ : int [1:2] 97 96
#> ..$ : int [1:2] 60 68
#> ..$ : int [1:2] 13 38
#> $ indp : int [1:5] 817 617 118 759 276
#> $ causal_gene : num [1:5] 1 0 0 0 0
#> $ ye :List of 5
#> ..$ : num [1:503] 0.735 -1.674 -0.598 -1.228 1.226 ...
#> ..$ : num [1:503] 0.837 0.222 0.176 0.363 1.132 ...
#> ..$ : num [1:503] -0.193 0.4 1.468 -1.147 0.451 ...
#> ..$ : num [1:503] -2.22 1.367 -0.133 0.671 0.437 ...
#> ..$ : num [1:503] -0.4886 0.0358 -0.4622 1.3442 0.6238 ...
#> $ yep : num [1:503, 1:5] 0.144 1.476 0.196 -0.431 1.396 ...
#> $ yep_true :List of 5
#> ..$ : num [1:503] -0.0551 -0.0551 0.2254 -0.3732 0.5435 ...
#> ..$ : num [1:503] -0.0161 -0.0161 0.2909 -0.5622 -0.0161 ...
#> ..$ : num [1:503] 0.294 0.294 0.294 0.294 -0.225 ...
#> ..$ : num [1:503] 0.078 0.855 0.078 0.821 0.112 ...
#> ..$ : num [1:503] 0.072 0.072 0.072 0.0755 0.072 ...
#> $ y : num [1:503] -1.0655 -0.0226 -0.4479 -0.8993 0.9927 ...
#> $ summary_stats:'data.frame': 950 obs. of 5 variables:
#> ..$ snp: int [1:950] 58544 58545 58546 58547 58548 58549 58550 58551 58552 58553 ...
#> ..$ b : num [1:950] -0.0294 0.0559 -0.0311 -0.0311 -0.0368 ...
#> ..$ se : num [1:950] 0.0464 0.0464 0.0464 0.0464 0.0464 ...
#> ..$ z : num [1:950, 1] -0.658 1.253 -0.696 -0.696 -0.825 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:950] "rs12144655" "rs4656206" "rs991628" "rs7535510" ...
#> .. .. ..$ : NULL
#> ..$ N : num [1:950] 503 503 503 503 503 503 503 503 503 503 ...
#> $ Xe :List of 5
#> ..$ : int [1:100, 1:503] 1 2 2 0 0 2 0 2 0 1 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:100] "rs2151331" "rs12032644" "rs1333141" "rs7512094" ...
#> .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#> ..$ : int [1:100, 1:503] 0 0 0 1 1 1 1 1 1 1 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:100] "rs10919527" "rs10919532" "rs10800572" "rs11810972" ...
#> .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#> ..$ : int [1:100, 1:503] 0 0 0 0 0 1 0 1 0 1 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:100] "rs12091865" "rs2020860" "rs7556029" "rs7518693" ...
#> .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#> ..$ : int [1:100, 1:503] 1 0 1 0 0 0 1 0 0 0 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:100] "rs6670505" "rs10798416" "rs12407816" "rs12741782" ...
#> .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#> ..$ : int [1:100, 1:503] 0 0 2 0 0 0 0 0 0 0 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> .. .. ..$ : chr [1:100] "rs2298914" "rs10913576" "rs7556644" "rs10913578" ...
#> .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#> $ Xp : int [1:950, 1:503] 0 0 2 2 0 0 0 2 0 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:950] "rs12144655" "rs4656206" "rs991628" "rs7535510" ...
#> .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#> $ inde_t : num [1:10] 58745 58740 58838 58886 59075 ...
#> $ coeffe :List of 5
#> ..$ : num [1:2] -0.179 -0.395
#> ..$ : num [1:2] -0.214 0.156
#> ..$ : num [1:2] -0.165 -0.171
#> ..$ : num [1:2] 0.5531 -0.0205
#> ..$ : num [1:2] -0.17189 0.00245
#> $ coeffp : num [1:2, 1:5] 817 -0.0254 617 0.3078 118 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "indp" ""
#> .. ..$ : NULL
We run TWASKnockoff with the following code:
set.seed(100)
# number of genes
k = length(data$ye)
# indices of cis-variants for each candidate gene in the risk region
snpidx = data$snpidx
# gene expression levels for each gene in the eQTL study
ye = data$ye
# cis-genotype matrices for each gene in the eQTL study
Xe = data$Xe
# summary statistics for cis-SNPs in the risk region
summarystat = data$summary_stats
# cis-genotype matrix from reference panel for the risk region
Xp = data$Xp
# predicted gene expression levels based on the eQTL model in the reference panel
yep_true <- data$yep_true
correlation = 'improved'
ts = 'marginal'
result = TwasKnockoff(snpidx = snpidx, ye = ye, Xe = Xe,
summarystat = summarystat, Xp = Xp,
removemethod = 'lasso',
correlation = correlation, nrep = 10, ts = ts, yep_true = yep_true)
#> [1] "starting iteration 2"
#> [1] "starting iteration 3"
#> [1] "starting iteration 4"
#> [1] "starting iteration 5"
#> [1] "starting iteration 6"
#> [1] "starting iteration 7"
#> [1] "starting iteration 8"
#> [1] "starting iteration 9"
#> [1] "starting iteration 10"
#> [1] 6.634298 1.709169 -0.409658 2.326145 -1.729925
#> [1] "lambda0.1"
#> [1] "s: 0.292770749989659" "s: 0.629306473123514" "s: 0.408619782539088"
#> [4] "s: 0.238543036716141" "s: 0.863653162335963"
print(result$GK.filter$kappa[1:k])
#> [1] 0 1 1 1 0
print(result$GK.filter$tau[1:k])
#> [1] 20.2166355 0.4486120 0.5636451 1.7250568 2.8037662
print(result$GK.filter$q[1:k])
#> [1] 0.3333333 1.0000000 1.0000000 1.0000000 1.0000000
We obtain the knockoff statistics defined in the GhostKnockoff manuscript, as well as the q-values for all genetic elements. We focus on the first elements, corresponding to the result of genes. The first gene (the true causal gene) has a large value and a small q-value.