This tutorial demonstrates how to use the SEAGLE
package for chromosome-wide gene-based studies when the genotype data are from GWAS studies (.bed + .bim + .fam). We recommend using PLINK1.9 available from https://www.cog-genomics.org/plink/1.9/ to generate a .raw file for each of the SNP sets. For more information on .raw files, please refer to https://www.cog-genomics.org/plink/2.0/formats#raw. The .raw file records the allelic dosage for each SNP.
Examples files containing GWAS data (.bed, .bim, and .fam files) for all the genes in chromosome 22 and a corresponding gene list in glist-hg19
can be downloaded via the SEAGLE-example4-data.zip
file from http://jocelynchi.com/SEAGLE/SEAGLE-example4-data.zip.
To follow along with the PLINK conversion procedures, you will first need to install PLINK1.9 from https://www.cog-genomics.org/plink/1.9/ and unzip the example SEAGLE-example4-data.zip
file. Afterwards, you can type the PLINK commands below in a command prompt or terminal window in the folder where these files are located. This will produce a .raw file for each gene that you can read into R to obtain a relevant genetic marker matrix \({\bf G}\) that you can input into SEAGLE
.
After unzipping the SEAGLE-example4-data.zip
file, you will find two empty directories named plink_bash
and R_codes
. The plink_bash
directory is where we will write the PLINK1.9 commands that you will run to create the .raw files for each gene. The R_codes
directory is where we will write the R scripts for loading the resulting .raw files into R.
We will begin by telling R the location where you unzipped the SEAGLE-example4-data.zip
file. Then we will read in the gene list in glist-hg19
and subset for the genes in chromosome 22. If you unzipped the file to your Downloads directory on a Mac, your path might look like the following.
dir <- "/Users/user_name/Downloads/SEAGLE-example4-data/"
glist <- read.table(paste0(dir,"glist-hg19"), header = F)
glist_chr22 <- subset(glist, glist$V1 == 22)
Next, we will remove any duplicated genes for chromosome 22 in our gene list.
dup <- glist_chr22$V4[duplicated(glist_chr22$V4)] # find duplicated gene
dup
Notice that in this example, there are 4 duplicated genes given by the following.
# GSTT2: chr22:24,322,339-24,326,106(GRCh37/hg19 by Ensembl)
# GSTT2B: chr22:24,299,601-24,303,373(GRCh37/hg19 by Ensembl)
# RIMBP3B: chr22:21,738,040-21,743,458(GRCh37/hg19 by Entrez Gene)
# RIMBP3C: chr22:21,899,646-21,905,750(GRCh37/hg19 by Ensembl)
We will update our gene list by manually updating the positions of the duplicated genes.
glist_chr22 <- glist_chr22[!duplicated(glist_chr22$V4),]
glist_chr22[which(glist_chr22$V4 == "GSTT2"),2:3] <- c(24322339, 24326106)
glist_chr22[which(glist_chr22$V4 == "GSTT2B"),2:3] <- c(24299601,24303373)
glist_chr22[which(glist_chr22$V4 == "RIMBP3B"),2:3] <- c(21738040,21743458)
glist_chr22[which(glist_chr22$V4 == "RIMBP3C"),2:3] <- c(21899646,21905750)
write.table(glist_chr22, "glist-hg19_chr22_updated", col.names = F, row.names = F, quote = F)
# replace hyphen with underscore
glist_chr22$V4 <- sub("-", "_", glist_chr22$V4)
Now we will use PLINK1.9 to create a .raw file for each gene in chromosome 22, which includes all SNPs for a given gene. The following code will produce a sequence of bash files to produce .raw files for each gene in glist_chr22
. It will also produce a corresponding sequence of R code snippets for reaching each .raw file into R. Each bash file will run PLINK1.9 for num_genes=50
genes at a time.
# of plink commands per job
num_genes <- 50
bash_plink <- NULL # bash command
command <- c("#! /bin/bash", "#SBATCH --job-name=plink") # plink command
codeR <- NULL # R code
for (i in 1:nrow(glist_chr22)) {
command <- c(command, paste0("plink --bfile 1kg_phase1_chr22 --make-set
glist-hg19_chr22_updated --gene ", glist_chr22$V4[i]," --out ",
glist_chr22$V4[i]," --export A") )
# write R codes for reading .raw files, one R file for one gene
codeR <- paste0(glist_chr22$V4[i], " <- read.delim(\"", glist_chr22$V4[i],".raw\", sep=\" \")")
fileConn<-file(paste0(dir,"R_codes/", glist_chr22$V4[i],".R"))
writeLines(codeR, fileConn)
close(fileConn)
if(i %% num_genes == 0 | i==nrow(glist_chr22)){
# write plink commands in bash file
fileConn<-file(paste0(dir,"plink_bash/plink_job",i-num_genes+1,"-",i,".sh"))
writeLines(command, fileConn)
close(fileConn)
command <- c("#! /bin/bash", "#SBATCH --job-name=plink")
bash_plink <- c(bash_plink, paste0("sbatch plink_job",i-num_genes+1,"-",i,".sh"))
}
}
# Write bash commands to a text file
fileConn<-file(paste0("bash_plink.txt"))
writeLines(bash_plink, fileConn)
close(fileConn)
Note that if you want to run the .sh files in the plink_bash
directory on your local computer, you can employ the bash
command instead of SBATCH
in terminal or command prompt. Additionally, if you are using Linux, you may need to specify plink1.9
rather than just plink
.
Running the above code will setup the code and files to produce a single .raw file for each gene when running the commands in the bash_plink.txt
file. The .raw file for each gene will contain all the SNPs for that gene.
Now that we have data for each gene in the .raw format, we can load the .raw files sequentially into R for input into SEAGLE
as follows. We’ll first load the SEAGLE
package.
library(SEAGLE)
The following code shows how to test for the GxE interaction effect using SEAGLE
on these genes. To illustrate how to use SEAGLE, we will generate synthetic phenotype and covariate data for \(n=1092\) study participants.
As an example, we’ve included .raw files for the following subset of genes from chromosome 22 in the extdata
folder of this package: ACR, APOBEC3A, APOBEC3C, ARSA, ATF4, ATP5L2, BCRP2, BMS1P17, and BMS1P18. We’ve also included a corresponding gene list in glist-hg19_chr22_example
. In practice, you will want to specify your own dir
for the directory where you’ve stored your .raw files.
# Read in gene list
<- "../inst/extdata/"
dir <- read.delim(paste0(dir, "glist-hg19_chr22_example"), sep=" ", header=FALSE)
genelist
# Identify number of genes in genelist
<- dim(genelist)[1]
num_genes
# Generate synthetic phenotype and covariate data
<- 1092 # number of study participants
n
set.seed(1)
<- 2 * rnorm(n)
y
set.seed(2)
<- as.matrix(rnorm(n))
X
set.seed(3)
<- as.matrix(rnorm(n)) E
Now that we have \({\bf y}\), \({\bf X}\), \({\bf E}\), and \({\bf G}\), we can run SEAGLE
on each gene in genelist
. We will first perform data checking procedures on \({\bf G}\). Then we will input \({\bf y}\), \({\bf X}\), \({\bf E}\), and \({\bf G}\) for each gene into the prep.SEAGLE
function. The intercept = 0
parameter indicates that the first column of \({\bf X}\) is not the all ones vector for the intercept.
The preparation procedure formats the input data for the SEAGLE
function by checking the dimensions of the input data. It also pre-computes a QR decomposition for \(\widetilde{\bf X} = \begin{pmatrix} {\bf 1}_{n} & {\bf X} & {\bf E} \end{pmatrix}\), where \({\bf 1}_{n}\) denotes the all ones vector of length \(n\).
Afterwards, we will input the prepared data for each gene into the SEAGLE
function to compute the score-like test statistic \(T\) and its corresponding p-value. The init.tau
and init.sigma
parameters are the initial values for estimating \(\tau\) and \(\sigma\) in the REML EM algorithm.
# Initialize output containers for T and p-value for each gene
<- numeric(length=num_genes)
T.list <- numeric(length=num_genes)
pv.list
# Run SEAGLE on each gene in genelist
for (i in 1:num_genes) {
# Identify current gene
<- genelist[i,4]
gene_name
# Obtain G
<- read.delim(paste0(dir, gene_name, ".raw"), sep=" ")
Gtemp <- as.matrix(Gtemp[,-c(1:6)])
G <- dim(G)[2] ## number of SNPs
L
# Make weights
<- colMeans(G)
avg_newsnp = avg_newsnp/2 ## freq of allele "A"
fA = 1-fA ## freq of allele "a"
fa = fA; maf[fA>0.5]= fa[fA>0.5]## maf should be b/w 0 and 0.5
maf = G[ ,maf>0] ## only keep those SNPs with MAF>0
G <- maf[maf > 0] ## Keep only MAF > 0)
maf = sqrt(maf^(-3/4)) ## Note we take the square root here
wt if (length(wt) > 1) {
= G %*% diag(wt) ## Input this G
G_final else {
} <- NA
T.list[i] <- NA
pv.list[i] next
}
# Run SEAGLE
<- prep.SEAGLE(y=y, X=X, intercept=0,
objSEAGLE E=E, G=G_final)
<- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5)
res
# Save T and p-value into output lists
<- res$T
T.list[i] <- res$pv
pv.list[i] }
The score-like test statistics \(T\) for the G\(\times\)E effect and their corresponding p-values can be found in T.list
and pv.list
, respectively. We can take a look at the test statistics and p-values computed for each of the genes in genelist
.
<- cbind(T.list, pv.list)
resMat colnames(resMat) <- c("T", "p-value")
resMat#> T p-value
#> [1,] 7693.9187 0.6148780
#> [2,] 4022.4392 0.7990406
#> [3,] 3375.4804 0.9771967
#> [4,] 9698.0070 0.2089241
#> [5,] 3523.1752 0.3668638
#> [6,] 716.1017 0.3435655
#> [7,] 4792.4809 0.9336786
#> [8,] 139.5526 0.9186002
#> [9,] 139.5526 0.9186002
Many thanks to Yueyang Huang for generating the example data, PLINK1.9 code and bash files, and R scripts for reading in the .raw files for this tutorial.