This guide provides an overview of using the R
package SuperExactTest
for statistical testing and visualization of mult-set intersections. In this package, we implemented a theoretical framework for computing the exact statistical distributions of multi-set intersections (Wang et al 2015). Utilizing a forward algorithm based procedure with the computational complexity linear to the number of sets, we are able to efficiently calculate the intersection probability among a large number of sets. Multiple efficient and scalable visualization techniques are provided for illustrating multi-set intersections and the corresponding intersection statistics.
SuperExactTest
If you use the SuperExactTest
package, please cite the following paper:
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015) Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.
We demonstrate the utility of the package through analyzing multiple sets of expression quantitative trait loci (eQTLs) identified from several tissues. eQTLs are the genomic loci (as represented by genetic variants in practice) that regulate the gene expression. Identifying expression regulators is considered a powerful tool in dissecting the architecture of the genetics of disease and other complex phenotypes. It is widely believed that the gene expression regulation are both spatially (cell-type and tissue specificity) and temporally (different developmental stages).
As an example, we downloaded four sets of genes with cis-eQTLs (i.e. gene expression regulated by local genetic variants) detected in four different brain regions (Gibbs et al 2010) which had been deposited in the eQTL Browser database (www.ncbi.nlm.nih.gov/ projects/gap/eqtl/index.cgi; url no longer accessible) and performed statistical analyses of the intersections among the gene sets. For convenience, we pre-compiled the cis-eQTL genes into a list which has been included in the present package. After loading the package, the pre-compiled dataset can be imported as follows:
## Loading required package: grid
##
## Attaching package: 'SuperExactTest'
## The following objects are masked from 'package:base':
##
## intersect, union
The loaded data object is a list called cis.eqtls
which contains four vectors of gene symbols. To check the structure of the imported object, we can use command:
## List of 4
## $ CB : chr [1:147] "KCTD10" "CHURC1" "C10orf85" "AMFR" ...
## $ FC : chr [1:164] "CHURC1" "KCTD10" "AMFR" "SCG3" ...
## $ TC : chr [1:137] "CHURC1" "SCG3" "AMFR" "KCTD10" ...
## $ PONS: chr [1:101] "CHURC1" "KCTD10" "PEX6" "HMBOX1" ...
The names of the gene sets preserve the brain tissue information: CB (cerebellum), FC (frontal cortex), TC (temporal cortex) and PONS (pons region). It needs to be stressed that these cis-eQTL gene sets were detected from genome-wide gene expression profiling of 18,196 unique genes (Gibbs et al 2010).
The length of the cis-eQTL gene sets varies from 101 to 164:
## CB FC TC PONS
## 147 164 137 101
Assuming the cis-eQTL gene sets were independently and randomly sampled from the population of 18,196 unique genes profiled in the eQTL study, the probability of the number of common genes shared by the four cis-eQTL gene sets can be computed using function dpsets
which implements the exact probability calculation of multi-set intersection developed in this study. Before we perform the statistical test of the intersection among the four gene sets, let us firstly calculate the expected overlap size:
## [1] 5.53701e-05
Due to the large background population gene size whereas small gene set sizes, the expected intersection size is close to 0. It is obvious that the possible number of genes shared among the four cis-eQTL gene sets is from 0 to 101. We can compute the probability density distribution of the possible intersection sizes using the dpsets
function:
## [1] 9.999446e-01 5.536713e-05 1.487462e-09 2.584596e-14 3.266929e-19
## [6] 3.203379e-24 2.537574e-29 1.669915e-34 9.317026e-40 4.476026e-45
## [11] 1.874211e-50 6.907188e-56 2.258521e-61 6.596148e-67 1.730423e-72
## [16] 4.097358e-78 8.793254e-84 1.716552e-89 3.057708e-95 4.983934e-101
## [21] 7.451607e-107 1.024173e-112 1.296529e-118 1.514351e-124 1.634467e-130
## [26] 1.632415e-136 1.510523e-142 1.296427e-148 1.033055e-154 7.649616e-161
## [31] 5.267943e-167 3.376250e-173 2.015080e-179 1.120610e-185 5.809401e-192
## [36] 2.808707e-198 1.266884e-204 5.332802e-211 2.095428e-217 7.687359e-224
## [41] 2.633520e-230 8.425576e-237 2.517649e-243 7.026430e-250 1.831520e-256
## [46] 4.458637e-263 1.013598e-269 2.151522e-276 4.263524e-283 7.885797e-290
## [51] 1.361041e-296 2.191414e-303 3.290550e-310 4.606280e-317 4.940656e-324
## [56] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [61] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [66] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [71] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [76] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [81] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [86] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [91] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [96] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [101] 0.000000e+00 0.000000e+00
In the function call of dpsets
, the first argument is the intersection size, the second argument is a vector of the set sizes, and option \(n\) specifies the size of the background gene population from which the gene sets are collected. As expected, the probability density is maximized at intersection size 0 and decreases as the intersection size increases.
Next, we compute the observed intersection among the four gene sets and then the corresponding fold enrichment (FE) by:
common.genes=intersect(cis.eqtls[[1]], cis.eqtls[[2]], cis.eqtls[[3]],
cis.eqtls[[4]])
(num.observed.overlap=length(common.genes))
## [1] 56
## [1] 1011376
Note that we re-implement the built-in R
function intersect
to make it capable of handling more than two input vectors.
The probability density of the observed intersection size is therefore:
## [1] 0
The probability of observing 56 or more intersection genes can be calculated using the cumulative probability function cpsets
:
## [1] 0
Like function dpsets
, cpsets
takes similar arguments as input, with an additional argument lower.tail
specifying which of the one-tail probabilities to be returned. num.observed.overlap - 1
because the probability is \(P(X \leq x)\) when lower.tail = TRUE
while \(P(X > x)\) when lower.tail = FALSE
.
The extremly low one-tail probability (0 due to computation precision) of the observed intersection size underlies significant overlap of common cis regulators across four brain regions in the present cis-eQTL dataset, rejecting the null hypothesis that the cis-eQTL gene sets were independent random samples from the population of 18,196 unique genes profiled.
Alternative to finding intersection and statistical test sequentially, we also implement a function MSET
to perform the intersection operation and probability calculation in one go:
## [1] 1011376
## [1] 0
In the above, we analyzed one specific intersection that is shared by all four gene sets. In practice, we may be also interested in the intersections among any combinations of the gene sets, such as overlap among any two or three gene sets in this case. To facilitate a comprehensive analysis of \(2^n-1\) possible intersections for \(n\) sets, we designed a function supertest
to enumerate all possible intersections given a list of input vectors and perform the statistical tests of the intersections automatically. Again we illustrate the function usage with the four cis-eQTL gene sets:
The returned variable res
is an object of new S3
class "msets"
which is a specifically designed list holding the analysis results and can be processed by generic functions including plot
and summary
. For example, to visualize the analysis result, we can plot the analysis results in a circular layout as shown in the following figure:
The four tracks in the middle represent the four gene sets, with individual blocks showing “presence” (green) or “absence” (grey) of the gene sets in each intersection. To denote the four gene sets with different colors, set argument color.on
to NULL or provide it with a vector of user-defined colors. The height of the bars in the outer layer is proportional to the intersection sizes, as indicated by the numbers on the top of the bars. The color intensity of the bars represents the P value significance of the intersections.
As the option name suggests, sort.by="size"
instructs the intersections to be sorted by size. Users are flexible to use different color schemes or sort intersections by P value, set, size or degree. The complete control options can be obtained from the help documentation by command help("plot.msets")
. For example, we can visualize the results in a landscape (matrix) layout as shown in Figure which includes only intersections among 2 to 4 sets by option degree = 2:4
:
The matrix of solid and empty circles at the bottom illustrates the “presence” (solid green) or “absence” (empty) of the gene sets in each intersection. The numbers to the right of the matrix are set sizes. The colored bars on the top of the matrix represent the intersection sizes with the color intensity showing the P value significance.
The generic summary
function can be used to summarize the analysis results in details:
## A msets object with 4 sets: CB FC TC PONS
## Background size: 18196
## Summary of intersections:
## Intersections Degree Observed.Overlap Expected.Overlap FE
## 0001 PONS 1 101 NA NA
## 0010 TC 1 137 NA NA
## 0011 TC & PONS 2 72 0.7604418554 9.468180e+01
## 0100 FC 1 164 NA NA
## 0101 FC & PONS 2 74 0.9103099582 8.129099e+01
## 0110 FC & TC 2 108 1.2347768740 8.746519e+01
## 0111 FC & TC & PONS 3 68 0.0068538395 9.921446e+03
## 1000 CB 1 147 NA NA
## 1001 CB & PONS 2 62 0.8159485601 7.598518e+01
## 1010 CB & TC 2 72 1.1067817103 6.505348e+01
## 1011 CB & TC & PONS 3 56 0.0061433806 9.115502e+03
## 1100 CB & FC 2 76 1.3249065729 5.736254e+01
## 1101 CB & FC & PONS 3 60 0.0073541198 8.158692e+03
## 1110 CB & FC & TC 3 67 0.0099753902 6.716529e+03
## 1111 CB & FC & TC & PONS 4 56 0.0000553701 1.011376e+06
## P.value Elements
## 0001 NA AATF, ABHD12, ACCS, ...
## 0010 NA ABHD12, ACCS, ACP6, ...
## 0011 2.102340e-138 ABHD12, ACCS, ACP6, ...
## 0100 NA ABHD12, ADAD2, ADAL, ...
## 0101 3.399208e-136 ABHD12, AMFR, ATPIF1 ...
## 0110 2.081928e-212 ABHD12, ADAL, AKAP8, ...
## 0111 2.400407e-273 ABHD12, AMFR, ATPIF1 ...
## 1000 NA ABHD10, ABHD12, ACO1 ...
## 1001 6.348584e-109 ABHD12, AMFR, ATPIF1 ...
## 1010 1.165913e-120 ABHD12, ADAL, AMFR, ...
## 1011 1.105986e-218 ABHD12, AMFR, ATPIF1 ...
## 1100 2.768779e-122 ABHD12, ADAL, AMFR, ...
## 1101 6.827755e-232 ABHD12, AMFR, ATPIF1 ...
## 1110 1.754835e-252 ABHD12, ADAL, AMFR, ...
## 1111 0.000000e+00 ABHD12, AMFR, ATPIF1 ...
The main output from the summary
function is a data.frame, with each row representing an intersection and its corresponding test statistics. Details about the summary
function output are available from the help documentation:
To tabulate the summary
result into a file, use command:
Gibbs et al. (2010) Abundant Quantitative Trait Loci Exist for DNA Methylation and Gene Expression in Human Brain. PLoS Genetics 6: e1000952.
Minghui Wang, Yongzhong Zhao, and Bin Zhang (2015) Efficient Test and Visualization of Multi-Set Intersections. Scientific Reports 5: 16923.