This tutorial demonstrates how to use the SEAGLE
package when the genotype data are from GWAS studies (.bed + .bim + .fam) or next generation sequencing studies (.vcf). We recommend using PLINK1.9 available from https://www.cog-genomics.org/plink/1.9/ to generate a .raw file (see https://www.cog-genomics.org/plink/2.0/formats#raw) for each of the SNP sets. The .raw file records the allelic dosage for each SNP.
Examples files containing 1kg_phase1_chr22.bed
, 1kg_phase1_chr22.bim
, 1kg_phase1_chr22.fam
, and 1kg_phase1_chr22.vcf
files and a corresponding gene list in glist-hg19
can be downloaded via the SEAGLE-example3-data.zip
file from http://jocelynchi.com/SEAGLE/SEAGLE-example3-data.zip. To follow along with the PLINK conversion procedures, you will first need to install PLINK1.9 from https://www.cog-genomics.org/plink/1.9/ and unzip the example SEAGLE-example3-data.zip
file. Afterwards, you can type the PLINK commands below in a command prompt or terminal window in the folder where these files are located. This will produce .raw files that you can read into R to obtain the relevant genetic marker matrix \({\bf G}\) for input into SEAGLE
.
For GWAS data in the .bed + .bim + .fam format, we recommend first converting to the .raw format with PLINK1.9. To illustrate the analysis of a SNP set from a single gene, we employ the gene ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from gene ACR.
plink --bfile 1kg_phase1_chr22 --make-set glist-hg19 --gene ACR --out ACR --export A
To illustrate the analysis of a SNP set from multiple genes, we employ the sequence of genes from A4GALT and ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from genes A4GALT and ACR.
plink --bfile 1kg_phase1_chr22 --make-set glist-hg19 --gene A4GALT ACR --out A4GALT_ACR --export A
For next generation sequencing data in the .vcf format, we recommend first converting to the .raw format with PLINK1.9. To illustrate the analysis of a SNP set from a single gene, we employ the gene ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from gene ACR.
plink --vcf 1kg_phase1_chr22.vcf --make-set glist-hg19 --gene ACR --out ACR --export A
To illustrate the analysis of a SNP set from multiple genes, we employ the sequence of genes from A4GALT and ACR as an example. We type the following commands in command prompt or terminal to produce the corresponding .raw file, which includes all SNPs from genes A4GALT and ACR.
plink --vcf 1kg_phase1_chr22.vcf --make-set glist-hg19 --gene A4GALT ACR --out A4GALT_ACR --export A
Now that we have the genotype data in the .raw format, we can load it into R as follows. We’ll first load the SEAGLE
package.
library(SEAGLE)
The following code will load the ACR.raw
file produced from the PLINK conversion for GWAS data for single gene analysis described above. Of course, the actual file location will depend on where you stored the results of the PLINK conversion on your local drive.
acr <- read.delim("ACR.raw", sep=" ")
As an example, we’ve included the ACR.raw
file in the extdata
folder of this package. The following code loads that file so we can use it in this tutorial.
<- system.file("extdata", "ACR.raw", package = "SEAGLE")
acr_loc <- read.delim(acr_loc, sep=" ") acr
We can take a quick look at the resulting acr
object. The first 6 columns correspond to identifying information.
dim(acr)
#> [1] 1092 107
head(acr)
#> FID IID PAT MAT SEX PHENOTYPE rs200755264_A rs145373249_G rs201146734_G
#> 1 0 HG00096 0 0 1 -9 0 0 0
#> 2 0 HG00097 0 0 2 -9 0 0 0
#> 3 0 HG00099 0 0 2 -9 0 0 0
#> 4 0 HG00100 0 0 2 -9 0 0 0
#> 5 0 HG00101 0 0 1 -9 0 0 0
#> 6 0 HG00102 0 0 2 -9 0 0 0
#> rs79314756_T rs200079264_C rs142494818_T rs77216309_T rs190253671_C
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs182163802_A rs73174437_T rs12165648_A rs116026001_G rs146910789_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 1 0 0 0
#> 6 0 0 0 0 0
#> rs185375060_A rs191165317_A rs140628153_T rs1064733_T rs114634361_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs150273293_T rs77313877_A rs189214490_A rs75385660_C rs147948077_A
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs200217266_G rs2285395_A rs143887109_A rs140364806_T rs182889325_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs6010067_C rs58109483_G rs141844965_T rs145800154_C rs148192955_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs2283675_A rs140237069_C rs191736410_C rs59998603_T rs142530125_G
#> 1 1 0 0 0 0
#> 2 1 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs184104408_T rs188103638_T rs146034851_C rs138777072_A rs192800892_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs2283676_A rs185550366_T rs190654794_C rs181504904_G rs186157598_A
#> 1 1 0 0 0 0
#> 2 1 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs188568125_T rs181184416_T rs185832541_A rs5770999_T rs73433406_T
#> 1 0 0 0 1 0
#> 2 0 0 0 2 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs150628453_A rs6010068_T rs190741616_T rs6010069_G rs9616824_T rs6010070_G
#> 1 0 1 0 0 1 0
#> 2 0 1 0 0 0 0
#> 3 0 1 0 1 0 1
#> 4 0 0 0 1 0 0
#> 5 0 1 0 1 0 1
#> 6 0 0 0 0 0 0
#> rs183029053_G rs150059289_C rs186986450_T rs201638160_T rs57763389_A
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 0 0 0 0
#> 6 0 0 0 0 0
#> rs144808185_T rs138662706_C rs199878967_A rs141891250_T rs201174415_CT
#> 1 0 0 0 0 0
#> 2 0 0 0 0 0
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 1 1 1 0 0
#> 6 1 1 1 0 0
#> rs112131405_C rs13058392_G rs149554062_A rs6010072_G rs9616953_C rs6009960_G
#> 1 0 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 1 0 1 1 1 1
#> 4 1 0 1 1 1 1
#> 5 1 0 0 1 1 1
#> 6 1 0 0 1 1 0
#> rs9616954_G rs13056271_C rs56807126_C rs34756634_G rs9616955_T rs13056621_A
#> 1 0 0 0 0 0 0
#> 2 0 0 0 1 0 1
#> 3 1 0 0 1 1 1
#> 4 0 1 1 0 0 0
#> 5 0 1 0 0 0 0
#> 6 0 0 0 0 0 1
#> rs201458351_G rs9616825_G rs182754197_G rs146416000_A rs6010073_C
#> 1 0 1 0 0 0
#> 2 0 2 1 0 0
#> 3 0 0 0 0 2
#> 4 0 0 0 0 0
#> 5 0 0 0 0 1
#> 6 1 0 0 0 0
#> rs187441595_T rs116139503_G rs190644757_G rs28516879_A rs182503014_T
#> 1 0 0 0 0 0
#> 2 0 0 0 0 1
#> 3 0 0 0 1 0
#> 4 0 0 0 0 0
#> 5 0 0 0 1 1
#> 6 0 0 0 0 1
#> rs188459010_C rs56238942_G rs6009961_A rs6010074_T rs139027620_C rs58651371_T
#> 1 0 0 1 0 0 1
#> 2 0 0 2 0 0 2
#> 3 0 0 0 0 1 0
#> 4 0 0 0 0 0 0
#> 5 0 0 0 0 1 1
#> 6 0 0 0 0 1 1
#> rs71232632_G rs5771002_A rs184517959_T rs201494682_T
#> 1 0 1 0 0
#> 2 0 1 0 0
#> 3 0 1 2 0
#> 4 0 1 0 0
#> 5 0 1 1 0
#> 6 0 2 0 0
The remaining columns contain our genetic marker matrix \({\bf G}\). We’ll extract these to input into SEAGLE
.
<- as.matrix(acr[,-c(1:6)])
G dim(G)
#> [1] 1092 101
As an example, we’ll generate synthetic phenotype, \({\bf y}\), and covariate data, \({\bf X}\) and \({\bf E}\).
# Determine number of individuals and loci
<- dim(G)[1]
n <- dim(G)[2]
L
# Generate synthetic phenotype and covariate data
set.seed(1)
<- 2 * rnorm(n)
y
set.seed(2)
<- rnorm(n)
X
set.seed(3)
<- rnorm(n) E
Now that we have \({\bf y}\), \({\bf X}\), \({\bf E}\), and \({\bf G}\), we can prepare our data for input into SEAGLE
. We will input our \({\bf y}\), \({\bf X}\), \({\bf E}\), and \({\bf G}\) into the prep.SEAGLE
function. The intercept = 0
parameter indicates that the first column of \({\bf X}\) is not the all ones vector for the intercept.
This preparation procedure formats the input data for the SEAGLE
function by checking the dimensions of the input data. It also pre-computes a QR decomposition for \(\widetilde{\bf X} = \begin{pmatrix} {\bf 1}_{n} & {\bf X} & {\bf E} \end{pmatrix}\), where \({\bf 1}_{n}\) denotes the all ones vector of length \(n\).
<- prep.SEAGLE(y=y, X=X, intercept=0, E=E, G=G) objSEAGLE
Now we can input the prepared data into the SEAGLE
function to compute the score-like test statistic \(T\) and its corresponding p-value. The init.tau
and init.sigma
parameters are the initial values for estimating \(\tau\) and \(\sigma\) in the REML EM algorithm.
<- SEAGLE(objSEAGLE, init.tau=0.5, init.sigma=0.5)
res $T
res#> [1] 915.93
$pv
res#> [1] 0.725045
The score-like test statistic \(T\) for the G\(\times\)E effect and its corresponding p-value can be found in res$T
and res$pv
, respectively.
Many thanks to Yueyang Huang for his help with generating the example data and PLINK1.9 code for this tutorial.