GSEMA (Gene Set Enrichment Meta-Analysis) performs all the necessary steps of a gene set enrichment meta-analysis based on the combination of effect size. In preparation for calculating different effect sizes, first, a single sample enrichment analysis is performed. Based on the various enrichment values, the effect size is calculated as a difference of means. Subsequently, the package allows for the common steps of a meta-analysis based on the effect sizes.
GSEMA 0.99.3
GSEMA is a package designed to perform gene set enrichment meta-analysis.
The gene set enrichment meta-analysis allows for obtaining the differentially
regulated gene sets or pathways that are shared across various studies.
Specifically, GSEMA applies a meta-analysis based on the combination of
effect sizes obtained from single sample enrichment (SSE) analysis applied to
individual the different studies. The different effect sizes of each study for
each of the gene sets are calculated using a difference of means based on the
enrichment scores obtained from SSE analysis. GSEMA offers various methods to
obtain the statistics for the difference of means.
Subsequently, GSEMA allows for applying the various steps typical of a meta-analysis, in a similar way to the gene expression meta-analysis (Toro-Domínguez et al. (2020)), although different adjustments to the calculated statistics are performed in order to correct the possible existence of false positive bias.
This vignette provides a tutorial-style introduction to all the steps necessary to properly conduct gene set enrichment meta-analysis using the GSEMA package.
GSEMA uses a specific object as input, which is a list of nested lists where each nested list corresponds to a study. This object can be created directly by the users or they can use createObjectMApath() function to create it.
For the examples that are going to be shown, synthetic data will be used. We load the sample data into our R session.
library(GSEMA)
data("simulatedData")
As previously stated, the meta-analysis input in GSEMA is a list of nested lists. Each nested list contains two elements:
This object can be created directly by the user or we can make use of createObjectMApath() function, which creates the objectMApath after indicating:
createObjectMApath() function allows to create the object needed to perform meta-analysis. In this case, it is necessary to indicate as input of the function the variables that contain the experimental and reference groups:
#List of expression matrices
listMatrices <- list(study1Ex, study2Ex)
listPhenodata <- list(study1Pheno, study2Pheno)
It is important to note that if any element does not belong to the experimental or the reference group, that sample is not taken into account in the creation of meta-analysis object.
Here, we have included an example to show how exactly the function is used:
Since this function can be a bit complicated if there are many datasets, we recommend creating a vector to keep the column names of the phenodatas that contains the variable that identifies the groups to compare (namePheno argument). Moreover, we should create two others lits to indicate how to identify experimental (cases) and reference (controls) groups in these variables (expGroups and refGroups arguments).
If we look at the example phenodatas we can observed that the groups variable is “condition”. Experimental group is named as “Case” and reference group as “Healthy”.
study1Pheno$Condition
## [1] "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy"
## [8] "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy" "Healthy"
## [15] "Healthy" "Case" "Case" "Case" "Case" "Case" "Case"
## [22] "Case" "Case" "Case" "Case" "Case" "Case" "Case"
## [29] "Case" "Case"
head(names(GeneSets))
## [1] "KEGG_ABC_TRANSPORTERS"
## [2] "KEGG_ACUTE_MYELOID_LEUKEMIA"
## [3] "KEGG_ADHERENS_JUNCTION"
## [4] "KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY"
## [5] "KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM"
## [6] "KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION"
GeneSets[[1]]
## [1] "ABCA1" "ABCA10" "ABCA12" "ABCA13" "ABCA2" "ABCA3" "ABCA4" "ABCA5"
## [9] "ABCA6" "ABCA7" "ABCA8" "ABCA9" "ABCB1" "ABCB10" "ABCB11" "ABCB4"
## [17] "ABCB5" "ABCB6" "ABCB7" "ABCB8" "ABCB9" "ABCC1" "ABCC10" "ABCC11"
## [25] "ABCC12" "ABCC2" "ABCC3" "ABCC4" "ABCC5" "ABCC6" "ABCC8" "ABCC9"
## [33] "ABCD1" "ABCD2" "ABCD3" "ABCD4" "ABCG1" "ABCG2" "ABCG4" "ABCG5"
## [41] "ABCG8" "CFTR" "TAP1" "TAP2"
In parallelization, several aspects must be considered. “n.cores” refers to the parallelization of studies or datasets. Therefore, if we have 3 studies, the maximum number for “n.cores” will be 3. internal.n.cores refers to the parallelization of single sample enrichment methods. This is especially recommended for the ssGSEA method. For Singscore and GSVA, it may also be advisable. The process is parallelized based on the samples in each study. Therefore, the larger the number of samples, the slower the process will be. The number of cores that the computer will use is the multiplication of both parameters n.cores * internal.n.cores = total cores.
We all this information we can create the vector for namePheno argument and the two list for expGroups and refGroups:
listMatrices <- list(study1Ex, study2Ex)
listPhenodata <- list(study1Pheno, study2Pheno)
phenoGroups <- c("Condition","Condition", "Condition")
phenoCases <- list("Case", "Case", "Case")
phenoControls <- list("Healthy", "Healthy", "Healthy")
objectMApathSim <- createObjectMApath(
listEX = listMatrices,
listPheno = listPhenodata, namePheno = phenoGroups,
expGroups = phenoCases, refGroups = phenoControls,
geneSets = GeneSets,
pathMethod = "Zscore",
n.cores = 1,
internal.n.cores = 1, minSize = 7)
## Applying the Zscore method
The result obtained is the proper object to perform meta-analysis (objectMApath).
GSEMA package has implemented gene set enrichment meta-analysis techniques based on the combination of effect sizes in the metaAnalysisDEpath() function. Although this function can be applied directly, it is advisable to consider some previous steps.
Before performing the meta-analysis, it is important to filter out those pathways with low expression in the datasets. This is because the results of the meta-analysis could be biased by the presence of pathways with low expression, which may lead to an increase in the number of false positives. GSEMA provides a function called **filterPaths()* that allows to filter out those pathways with low expression. The inputs of this function are:
objectMApathSim <- filteringPaths(objectMApathSim, threshold = 0.65)
GSEMA has implemented three different estimator for calculating the effect sizes in each study:
The metaAnalysisDEpath() function allows to perform a meta-analysis in only one step, needing only the meta-analysis object created previously.
This function has as input:
In the following example, we have applied a Random Effect model to the GSEMA object (“objectMApathSim”) we have been working with so far. In addition we have applied the “limma” method to calculate the effect size. Finally, we have allowed a 0.3 proportion of missing values in a sample and a gene set must have been contained in all studies:
results <- metaAnalysisESpath(objectMApath = objectMApathSim,
measure = "limma", typeMethod = "REM", missAllow = 0.3,
numData = length(objectMApathSim))
## Performing Random Effects Model
The output of this function is a dataframe with the results of the meta-analysis where rows are the genes and columns are the different variables provided by the meta-analysis:
results[1,]
## Pathway
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION
## Com.ES ES.var Qval
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION -2.07387 0.1217558 0.7608425
## tau2 Zval Pval
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION 0 -5.943425 2.791278e-09
## FDR numDatasets
## KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION 1.544507e-07 2
results[nrow(results),]
## Pathway Com.ES ES.var Qval tau2
## Simulated_Pathway Simulated_Pathway 14.17285 3.270126 1.595479 2.479981
## Zval Pval FDR numDatasets
## Simulated_Pathway 7.837454 4.662937e-15 7.740475e-13 2
The results are provided in a dataframe with the variables:
Finally, we can represent in a heatmap the significant gene sets in order to observe how they are regulated in each of the studies. In heatmapPaths() function we have to include both the object that has been used in the meta-analysis, the result of it and the applied method. In addition, this package offers three different scaling approaches (scaling) in order to compare properly the gene set enrichment scores of the studies in the heatmap:
Moreover, in regulation argument, we can choose if we want to represent the over regulated or under regulated gene sets:
We can choose the number of significant gene sets (numSig) that we want to be shown on the graph and the adjusted p-value from which a gene set is considered as significant (fdrSig). In addition, the gene sets that are not presented in one sample are represented in gray.
Here we present an example of the heatmap which have been obtained from the result of applying a random effects model to the object “objectMApathSim” and making use of a “zscor” scaling approach.
res <- heatmapPaths(objectMA=objectMApathSim, results,
scaling = "zscor", regulation = "all",breaks=c(-2,2),
fdrSig = 0.05, comES_Sig = 1, numSig=20, fontsize = 5)
## scaling using z-score...
The calculateESpath() function returns the effects size in each of the studies. Moreover, it calculates the variance of each of the effects.
#Effects <- calculateESpath(objectMApath = objectMApathSim, measure = "limma")
#Effects[[1]]["Simulated_Pathway",]
#Effects[[2]]["Simulated_Pathway",]
#Session info
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GSEMA_0.99.3 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] mathjaxr_1.6-0 RColorBrewer_1.1-3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] magick_2.8.5 farver_2.1.2
## [7] rmarkdown_2.28 zlibbioc_1.50.0
## [9] vctrs_0.6.5 memoise_2.0.1
## [11] tinytex_0.53 htmltools_0.5.8.1
## [13] S4Arrays_1.4.1 progress_1.2.3
## [15] Rhdf5lib_1.26.0 SparseArray_1.4.8
## [17] rhdf5_2.48.0 sass_0.4.9
## [19] bslib_0.8.0 plyr_1.8.9
## [21] impute_1.78.0 cachem_1.1.0
## [23] lifecycle_1.0.4 iterators_1.0.14
## [25] pkgconfig_2.0.3 rsvd_1.0.5
## [27] Matrix_1.6-5 R6_2.5.1
## [29] fastmap_1.2.0 GenomeInfoDbData_1.2.12
## [31] rbibutils_2.2.16 MatrixGenerics_1.16.0
## [33] digest_0.6.37 numDeriv_2016.8-1.1
## [35] colorspace_2.1-1 AnnotationDbi_1.66.0
## [37] S4Vectors_0.42.1 singscore_1.24.0
## [39] irlba_2.3.5.1 GenomicRanges_1.56.1
## [41] RSQLite_2.3.7 beachmat_2.20.0
## [43] metadat_1.2-0 fansi_1.0.6
## [45] httr_1.4.7 abind_1.4-8
## [47] compiler_4.4.1 bit64_4.5.2
## [49] doParallel_1.0.17 metafor_4.6-0
## [51] BiocParallel_1.38.0 DBI_1.2.3
## [53] highr_0.11 HDF5Array_1.32.1
## [55] DelayedArray_0.30.1 rjson_0.2.23
## [57] tools_4.4.1 glue_1.8.0
## [59] nlme_3.1-165 rhdf5filters_1.16.0
## [61] grid_4.4.1 reshape2_1.4.4
## [63] generics_0.1.3 gtable_0.3.5
## [65] tidyr_1.3.1 hms_1.1.3
## [67] BiocSingular_1.20.0 ScaledMatrix_1.12.0
## [69] utf8_1.2.4 XVector_0.44.0
## [71] BiocGenerics_0.50.0 foreach_1.5.2
## [73] pillar_1.9.0 stringr_1.5.1
## [75] GSVA_1.52.3 limma_3.60.5
## [77] dplyr_1.1.4 lattice_0.22-5
## [79] bit_4.5.0 annotate_1.82.0
## [81] tidyselect_1.2.1 SingleCellExperiment_1.26.0
## [83] locfit_1.5-9.10 Biostrings_2.72.1
## [85] pbapply_1.7-2 knitr_1.48
## [87] bookdown_0.40 IRanges_2.38.1
## [89] edgeR_4.2.1 SummarizedExperiment_1.34.0
## [91] stats4_4.4.1 xfun_0.47
## [93] Biobase_2.64.0 statmod_1.5.0
## [95] matrixStats_1.4.1 pheatmap_1.0.12
## [97] stringi_1.8.4 UCSC.utils_1.0.0
## [99] yaml_2.3.10 evaluate_1.0.0
## [101] codetools_0.2-19 tibble_3.2.1
## [103] BiocManager_1.30.25 graph_1.82.0
## [105] cli_3.6.3 xtable_1.8-4
## [107] Rdpack_2.6.1 munsell_0.5.1
## [109] jquerylib_0.1.4 Rcpp_1.0.13
## [111] GenomeInfoDb_1.40.1 png_0.1-8
## [113] XML_3.99-0.17 parallel_4.4.1
## [115] ggplot2_3.5.1 blob_1.2.4
## [117] prettyunits_1.2.0 sparseMatrixStats_1.16.0
## [119] SpatialExperiment_1.14.0 GSEABase_1.66.0
## [121] scales_1.3.0 purrr_1.0.2
## [123] crayon_1.5.3 rlang_1.1.4
## [125] KEGGREST_1.44.1
Barbie, D. A., P. Tamayo, J. S. Boehm, S. Y. Kim, S. E. Moody, I. F. Dunn, and et al. 2009. “Systematic Rna Interference Reveals That Oncogenic Kras-Driven Cancers Require Tbk1.” Nature, 108–12. https://doi.org/10.1371/journal.pcbi.1000217.
Borenstein, M., L. V. Hedges, J. P. T. Higgins, and Rothstein. 2021. “Introduction to Meta-Analysis.” John Wiley and Sons.
Foroutan, M., D. D. Bhuva, R. Lyu, K. Horan, J. Cursons, and M. J. Davis. 2018. “Single Sample Scoring of Molecular Phenotypes.” BMC Bioinformatics, 404. https://doi.org/10.1186/s12859-018-2435-4.
Hänzelmann, S., R. Castelo, and J. Guinney. 2013. “GSVA: Gene Set Variation Analysis for Microarray and Rna-Seq Data.” BMC Bioinformatics, 7. https://doi.org/10.1186/1471-2105-14-7.
Hastie, T., R. Tibshirani, B. Narasimhan, and G. Chu. 2019. “Impute: Imputation for Microarray Data.”
Hedges, L. V. 1981. “Distribution Theory for Glass’s Estimator of Effect Size and Related Estimators.” Journal of Educational Statistics, 107–28. https://doi.org/10.2307/1164588.
Lazar, C., S. Meganck, J. Taminau, and et al. 2013. “Batch Effect Removal Methods for Microarray Gene Expression Data Integration: A Survey.” Briefings in Bioinformatics, 469–90. https://doi.org/10.1093/bib/bbs037.
Lee, E., H. Y. Chuang, J. W. Kim, T. Ideker, and D. Lee. 2008. “Inferring Pathway Activity Toward Precise Disease Classification.” PLOS Computational Biology, e1000217. https://doi.org/10.1371/journal.pcbi.1000217.
Smyth, G. K., M. Ritchie, N. Thorne, Hu Yifang, and et al. 2020. “Linear Models for Microarray and Rna-Seq Data User’s Guide.”
Toro-Domínguez, D., J. A. Villatoro-García J. A., J. Martorell-Marugán, and et al. 2020. “A Survey of Gene Expression Meta-Analysis:Methods and Applications.” Briefings in Bioinformatics, 1–12. https://doi.org/10.1093/bib/bbaa019.
Wickham, H., and D.Siedel. 2020. “Scale Functions for Visualization.”