Skip to contents

We generate individual-level gene expression levels and phenotypes based on 503 non-overlapping Europeans from 1000 Genomes (1KG) phase 3 (https://www.cog-genomics.org/plink/2.0/resources). We simulate a risk region including 950 cis-SNPs and 5 candidate genes. Each gene contains 100 cis-SNPs, including 2 eQTLs explaining 20% variation of gene expression levels. Only the first gene is causal to the trait of interest, explaining 50% of total heritability, which is set to 0.2. The remaining heritability is explained by 5 non-mediated causal SNPs.

We calculate the summary statistics (z-scores) for each cis-SNP in the risk region based on simulated phenotype and in-sample genotype data. For this example, we used the in-sample genotype to replace the reference panel.

str(data)
#> List of 14
#>  $ snpidx       :List of 5
#>   ..$ : int [1:100] 111 112 113 114 115 116 117 118 119 120 ...
#>   ..$ : int [1:100] 263 264 265 266 267 268 269 270 271 272 ...
#>   ..$ : int [1:100] 436 437 438 439 440 441 442 443 444 445 ...
#>   ..$ : int [1:100] 605 606 607 608 609 610 611 612 613 614 ...
#>   ..$ : int [1:100] 740 741 742 743 744 745 746 747 748 749 ...
#>  $ inde         :List of 5
#>   ..$ : int [1:2] 92 87
#>   ..$ : int [1:2] 33 81
#>   ..$ : int [1:2] 97 96
#>   ..$ : int [1:2] 60 68
#>   ..$ : int [1:2] 13 38
#>  $ indp         : int [1:5] 817 617 118 759 276
#>  $ causal_gene  : num [1:5] 1 0 0 0 0
#>  $ ye           :List of 5
#>   ..$ : num [1:503] 0.735 -1.674 -0.598 -1.228 1.226 ...
#>   ..$ : num [1:503] 0.837 0.222 0.176 0.363 1.132 ...
#>   ..$ : num [1:503] -0.193 0.4 1.468 -1.147 0.451 ...
#>   ..$ : num [1:503] -2.22 1.367 -0.133 0.671 0.437 ...
#>   ..$ : num [1:503] -0.4886 0.0358 -0.4622 1.3442 0.6238 ...
#>  $ yep          : num [1:503, 1:5] 0.144 1.476 0.196 -0.431 1.396 ...
#>  $ yep_true     :List of 5
#>   ..$ : num [1:503] -0.0551 -0.0551 0.2254 -0.3732 0.5435 ...
#>   ..$ : num [1:503] -0.0161 -0.0161 0.2909 -0.5622 -0.0161 ...
#>   ..$ : num [1:503] 0.294 0.294 0.294 0.294 -0.225 ...
#>   ..$ : num [1:503] 0.078 0.855 0.078 0.821 0.112 ...
#>   ..$ : num [1:503] 0.072 0.072 0.072 0.0755 0.072 ...
#>  $ y            : num [1:503] -1.0655 -0.0226 -0.4479 -0.8993 0.9927 ...
#>  $ summary_stats:'data.frame':   950 obs. of  5 variables:
#>   ..$ snp: int [1:950] 58544 58545 58546 58547 58548 58549 58550 58551 58552 58553 ...
#>   ..$ b  : num [1:950] -0.0294 0.0559 -0.0311 -0.0311 -0.0368 ...
#>   ..$ se : num [1:950] 0.0464 0.0464 0.0464 0.0464 0.0464 ...
#>   ..$ z  : num [1:950, 1] -0.658 1.253 -0.696 -0.696 -0.825 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:950] "rs12144655" "rs4656206" "rs991628" "rs7535510" ...
#>   .. .. ..$ : NULL
#>   ..$ N  : num [1:950] 503 503 503 503 503 503 503 503 503 503 ...
#>  $ Xe           :List of 5
#>   ..$ : int [1:100, 1:503] 1 2 2 0 0 2 0 2 0 1 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:100] "rs2151331" "rs12032644" "rs1333141" "rs7512094" ...
#>   .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#>   ..$ : int [1:100, 1:503] 0 0 0 1 1 1 1 1 1 1 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:100] "rs10919527" "rs10919532" "rs10800572" "rs11810972" ...
#>   .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#>   ..$ : int [1:100, 1:503] 0 0 0 0 0 1 0 1 0 1 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:100] "rs12091865" "rs2020860" "rs7556029" "rs7518693" ...
#>   .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#>   ..$ : int [1:100, 1:503] 1 0 1 0 0 0 1 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:100] "rs6670505" "rs10798416" "rs12407816" "rs12741782" ...
#>   .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#>   ..$ : int [1:100, 1:503] 0 0 2 0 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr [1:100] "rs2298914" "rs10913576" "rs7556644" "rs10913578" ...
#>   .. .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#>  $ Xp           : int [1:950, 1:503] 0 0 2 2 0 0 0 2 0 0 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:950] "rs12144655" "rs4656206" "rs991628" "rs7535510" ...
#>   .. ..$ : chr [1:503] "HG00096" "HG00097" "HG00099" "HG00100" ...
#>  $ inde_t       : num [1:10] 58745 58740 58838 58886 59075 ...
#>  $ coeffe       :List of 5
#>   ..$ : num [1:2] -0.179 -0.395
#>   ..$ : num [1:2] -0.214 0.156
#>   ..$ : num [1:2] -0.165 -0.171
#>   ..$ : num [1:2] 0.5531 -0.0205
#>   ..$ : num [1:2] -0.17189 0.00245
#>  $ coeffp       : num [1:2, 1:5] 817 -0.0254 617 0.3078 118 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:2] "indp" ""
#>   .. ..$ : NULL

We run TWASKnockoff with the following code:

set.seed(100)

# number of genes
k = length(data$ye)
# indices of cis-variants for each candidate gene in the risk region
snpidx = data$snpidx 
# gene expression levels for each gene in the eQTL study
ye = data$ye 
# cis-genotype matrices for each gene in the eQTL study
Xe = data$Xe 
# summary statistics for cis-SNPs in the risk region
summarystat = data$summary_stats 
# cis-genotype matrix from reference panel for the risk region
Xp = data$Xp 
# predicted gene expression levels based on the eQTL model in the reference panel
yep_true <- data$yep_true 


correlation = 'improved'
ts = 'marginal'

result = TwasKnockoff(snpidx = snpidx, ye = ye, Xe = Xe, 
             summarystat = summarystat, Xp = Xp, 
             removemethod = 'lasso', 
             correlation = correlation, nrep = 10, ts = ts, yep_true = yep_true)
#> [1] "starting iteration 2"
#> [1] "starting iteration 3"
#> [1] "starting iteration 4"
#> [1] "starting iteration 5"
#> [1] "starting iteration 6"
#> [1] "starting iteration 7"
#> [1] "starting iteration 8"
#> [1] "starting iteration 9"
#> [1] "starting iteration 10"
#> [1]  6.634298  1.709169 -0.409658  2.326145 -1.729925
#> [1] "lambda0.1"
#> [1] "s: 0.292770749989659" "s: 0.629306473123514" "s: 0.408619782539088"
#> [4] "s: 0.238543036716141" "s: 0.863653162335963"

print(result$GK.filter$kappa[1:k])
#> [1] 0 1 1 1 0
print(result$GK.filter$tau[1:k])
#> [1] 20.2166355  0.4486120  0.5636451  1.7250568  2.8037662
print(result$GK.filter$q[1:k])
#> [1] 0.3333333 1.0000000 1.0000000 1.0000000 1.0000000

We obtain the knockoff statistics defined in the GhostKnockoff manuscript, as well as the q-values for all genetic elements. We focus on the first kk elements, corresponding to the result of kk genes. The first gene (the true causal gene) has a large tautau value and a small q-value.