ActivePathways is a tool for multivariate pathway enrichment analysis that identifies gene sets, such as pathways or Gene Ontology terms, that are over-represented in a list or matrix of genes. ActivePathways uses a data fusion method to combine multiple omics datasets, prioritizes genes based on the significance and direction of signals from the omics datasets, and performs pathway enrichment analysis of these prioritized genes. We can find pathways and genes supported by single or multiple omics datasets, as well as additional genes and pathways that are only apparent through data integration and remain undetected in any single dataset alone.
The new version of ActivePathways is described in our recent publication in Nature Communications (2024).
Mykhaylo Slobodyanyuk^, Alexander T. Bahcheli^, Zoe P. Klein, Masroor Bayati, Lisa J. Strug, Jüri Reimand. Directional integration and pathway enrichment analysis for multi-omics data. Nature Communications (2024) (^ - co-first authors) https://www.nature.com/articles/s41467-024-49986-4 https://pubmed.ncbi.nlm.nih.gov/38971800/
The first version of ActivePathways was published in Nature Communications with the PCAWG Pan-Cancer project.
Marta Paczkowska^, Jonathan Barenboim^, Nardnisa Sintupisut, Natalie S. Fox, Helen Zhu, Diala Abd-Rabbo, Miles W. Mee, Paul C. Boutros, PCAWG Drivers and Functional Interpretation Working Group, PCAWG Consortium, Juri Reimand. Integrative pathway enrichment analysis of multivariate omics data. Nature Communications 11 735 (2020) (^ - co-first authors) https://www.nature.com/articles/s41467-019-13983-9 https://pubmed.ncbi.nlm.nih.gov/32024846/
From a matrix of p-values, ActivePathways
creates a
ranked gene list where genes are prioritised based on their combined
significance. The combined significance of each gene is determined by
performing statistical data fusion on a series of omics datasets
provided in the input matrix. The ranked gene list includes the most
significant genes first. ActivePathways
then performs a
ranked hypergeometric test to determine if a pathway (i.e., a gene set
with a common functional annotation) is enriched in the ranked gene
list, by performing a series of hypergeometric tests (also known as
Fisher’s exact tests). In each such test, a larger set of genes from the
top of the ranked gene list is considered. At the end of the series, the
ranked hypergeometric test returns the top, most significant p-value
from the series, corresponding to the point in the ranked gene list
where the pathway enrichment reached the greatest significance of
enrichment. This approach is useful when the genes in our ranked gene
list have varying signals of biological importance in the input omics
datasets, as the test identifies the top subset of genes that are the
most relevant to the enrichment of the pathway.
A basic example of using ActivePathways is shown below.
We will analyse cancer driver gene predictions for a collection of cancer genomes. Each dataset (i.e., column in the matrix) contains a statistical significance score (P-value) where genes with small P-values are considered stronger candidates of cancer drivers based on the distribution of mutations in the genes. For each gene (i.e., row in the matrix), we have several predictions representing genomic elements of the gene, such as coding sequences (CDS), untranslated regions (UTR), and core promoters (promCore).
To analyse these driver genes using existing knowledge of gene function, we will use gene sets corresponding to known molecular pathways from the Reactome database. These gene sets are commonly distributed in text files in the GMT format (Gene Matrix Transposed) file.
Let’s start by reading the data from the files embedded in the R
package. For the p-value matrix, ActivePathways
expects an
object of the matrix class so the table has to be cast to the correct
class after reading the file.
scores <- read.table(
system.file('extdata', 'Adenocarcinoma_scores_subset.tsv', package = 'ActivePathways'),
header = TRUE, sep = '\t', row.names = 'Gene')
scores <- as.matrix(scores)
scores
## X3UTR X5UTR CDS promCore
## A2M NA 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS NA 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT 0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1 0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5 0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8 NA 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6 0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1 0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
## [ reached getOption("max.print") -- omitted 2406 rows ]
ActivePathways
does not allow missing (NA) values in the
matrix of P-values and these need to be removed. One conservative option
is to re-assign all missing values as ones, indicating our confidence
that the missing values are not indicative of cancer drivers.
Alternatively, one may consider removing genes with NA values.
The basic use of ActivePathways
requires only two input
parameters, the matrix of P-values with genes in rows and datasets in
columns, as prepared above, and the path to the GMT file in the file
system. Importantly, the gene IDs (symbols, accession numbers, etc) in
the P-value matrix need to match those in the GMT file.
Here we use a GMT file provided with the package. This GMT file is heavily filtered and outdated; thus users must provide their own GMT file when using the package. These GMT files can be acquired from multiple sources such as Gene Ontology, Reactome and others. For better accuracy and statistical power these pathway databases should be combined. Acquiring an up-to-date GMT file is essential to avoid using unreliable outdated annotations (see this paper).
library(ActivePathways)
gmt_file <- system.file('extdata', 'hsapiens_REAC_subset.gmt', package = 'ActivePathways')
ActivePathways(scores, gmt_file)
## term_id term_name adjusted_p_val term_size
## <char> <char> <num> <int>
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 4.625918e-04 555
## overlap evidence
## <list> <list>
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## Genes_X3UTR Genes_X5UTR
## <list> <list>
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## Genes_CDS
## <list>
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## Genes_promCore
## <list>
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## [ reached getOption("max.print") -- omitted 32 rows ]
ActivePathways 2.0 extends our integrative pathway analysis framework significantly. Users can now provide directional assumptions of input omics datasets for more accurate analyses. This allows us to prioritise genes and pathways where certain directional assumptions are met and penalise those where the assumptions are violated.
For example, fold-change in protein expression would be expected to associate positively with mRNA fold-change of the corresponding gene, while negative associations would be unexpected and indicate more-complex situations or potential false positives. We can instruct the pathway analysis to prioritise positively-associated protein/mRNA pairs and penalise negative associations (or vice versa).
Two additional inputs are included in ActivePathways that allow diverse multi-omics analyses. These inputs are optional.
The scores_direction and constraints_vector parameters are provided in the merge_p_values() and ActivePathways() functions to incorporate this directional penalty into the data fusion and pathway enrichment analyses.
The parameter constraints_vector is a vector that allows the user to represent the expected relationship between the input omics datasets. The vector size is n_datasets. Values include +1, -1, and 0.
The parameter scores_direction is a matrix that reflects the directions that the genes/transcripts/protein show in the data. The matrix size is n_genes * n_datasets, that is the same size as the P-value matrix. This is a numeric matrix, but only the signs of the values are accounted for.
Load a dataset of P-values and fold-changes for mRNA and protein levels. This dataset is embedded in the package. Examine a few example genes.
fname_data_matrix <- system.file('extdata',
'Differential_expression_rna_protein.tsv',
package = 'ActivePathways')
pvals_FCs <- read.table(fname_data_matrix, header = TRUE, sep = '\t')
example_genes <- c('ACTN4','PIK3R4','PPIL1','NELFE','LUZP1','ITGB2')
pvals_FCs[pvals_FCs$gene %in% example_genes,]
## gene rna_pval rna_log2fc protein_pval protein_log2fc
## 73 PIK3R4 1.266285e-03 1.1557077 2.791135e-03 -0.8344799
## 74 PPIL1 1.276838e-03 -1.1694221 1.199303e-04 -1.1193605
## 606 NELFE 1.447553e-02 -0.9120687 1.615592e-05 -1.2630114
## 4048 LUZP1 3.253382e-05 1.5830796 4.129125e-02 0.5791377
## 4050 ITGB2 4.584450e-05 1.6472117 1.327997e-01 0.4221579
## 4052 ACTN4 5.725503e-05 1.5531533 8.238317e-07 1.4279158
Create a matrix of gene/protein P-values. Where the columns are different omics datasets (mRNA, protein) and the rows are genes. Examine a few genes in the P-value matrix, and convert missing values to P = 1.
pval_matrix <- data.frame(
row.names = pvals_FCs$gene,
rna = pvals_FCs$rna_pval,
protein = pvals_FCs$protein_pval)
pval_matrix <- as.matrix(pval_matrix)
pval_matrix[is.na(pval_matrix)] <- 1
pval_matrix[example_genes,]
## rna protein
## ACTN4 5.725503e-05 8.238317e-07
## PIK3R4 1.266285e-03 2.791135e-03
## PPIL1 1.276838e-03 1.199303e-04
## NELFE 1.447553e-02 1.615592e-05
## LUZP1 3.253382e-05 4.129125e-02
## ITGB2 4.584450e-05 1.327997e-01
Create a matrix of gene/protein directions similarly to the P-value matrix (i.e., scores_direction). ActivePathways only uses the signs of the direction values (ie +1 or -1). If directions are missing (NA), we recommend setting the values to zero. Lets examine a few genes in the direction matrix.
dir_matrix <- data.frame(
row.names = pvals_FCs$gene,
rna = pvals_FCs$rna_log2fc,
protein = pvals_FCs$protein_log2fc)
dir_matrix <- as.matrix(dir_matrix)
dir_matrix <- sign(dir_matrix)
dir_matrix[is.na(dir_matrix)] <- 0
dir_matrix[example_genes,]
## rna protein
## ACTN4 1 1
## PIK3R4 1 -1
## PPIL1 -1 -1
## NELFE -1 -1
## LUZP1 1 1
## ITGB2 1 1
This matrix has to be accompanied by a vector that provides the expected relationship between the different datasets. Here, mRNA levels and protein levels are expected to have consistent directions: either both positive or both negative (eg log fold-change). Alternatively, we can use another vector to prioritise genes/proteins where the directions are the opposite.
## [1] 1 1
Now we merge the P-values of the two datasets using directional assumptions and compare these with the plain non-directional merging. The top 5 scoring genes differ if we penalise genes where this directional logic is violated: While 4 of 5 genes retain significance, the gene PIK3R4 is penalised. Interestingly, as a consequence of penalizing PIK3R4, other genes such as ITGB2 move up in rank.
directional_merged_pvals <- merge_p_values(pval_matrix,
method = "DPM", dir_matrix, constraints_vector)
merged_pvals <- merge_p_values(pval_matrix, method = "Brown")
sort(merged_pvals)[1:5]
## ACTN4 PPIL1 NELFE LUZP1 PIK3R4
## 1.168708e-09 2.556067e-06 3.804646e-06 1.950607e-05 4.790125e-05
## ACTN4 PPIL1 NELFE LUZP1 ITGB2
## 1.168708e-09 2.556067e-06 3.804646e-06 1.950607e-05 7.920157e-05
PIK3R4 is penalised because the fold-changes of its mRNA and protein levels are significant and have the opposite signs:
## gene rna_pval rna_log2fc protein_pval protein_log2fc
## 73 PIK3R4 0.001266285 1.155708 0.002791135 -0.8344799
## rna protein
## 0.001266285 0.002791135
## rna protein
## 1 -1
## PIK3R4
## 4.790125e-05
## PIK3R4
## 0.8122527
To assess the impact of the directional penalty on gene merged P-value signals we create a plot showing directional results on the y axis and non-directional results on the x. Blue dots are prioritised hits, red dots are penalised.
lineplot_df <- data.frame(original = -log10(merged_pvals),
modified = -log10(directional_merged_pvals))
ggplot(lineplot_df) +
geom_point(size = 2.4, shape = 19,
aes(original, modified,
color = ifelse(original <= -log10(0.05),"gray",
ifelse(modified > -log10(0.05),"#1F449C","#F05039")))) +
labs(title = "",
x ="Merged -log10(P)",
y = "Directional Merged -log10(P)") +
geom_hline(yintercept = 1.301, linetype = "dashed",
col = 'black', size = 0.5) +
geom_vline(xintercept = 1.301, linetype = "dashed",
col = "black", size = 0.5) +
geom_abline(size = 0.5, slope = 1,intercept = 0) +
scale_color_identity()
The constraints_vector parameter is provided in the merge_p_values() and ActivePathways() functions to incorporate directional penalty into the data fusion and pathway enrichment analyses. The constraints_vector parameter is a vector of size n_datasets that allows the user to represent the expected relationship between the input omics datasets, and includes the values +1, -1, and 0.
The constraints_vector should reflect the expected relative directional relationship between datasets. For example, the two constraints_vector values shown below are functionally identical.
We use different constraints_vector values depending on the type of data we are analyzing and the specific question we are asking. The simplest directional use case and the default for the ActivePathways package is to assume that the datasets have no directional constraints. This is useful when merging datasets where the relative impact of gene perturbations is not clear. For example, gene-level mutational burden and chromatin-immunoprecipitation sequencing (ChIP-seq) experiments can provide gene-level information, however, we cannot know from these datatypes whether gene function is increased or decreased. Therefore, we would set the constraints_vector for gene mutational burden and ChIP-seq datatypes to 0.
When combining datasets that have directional information, we provide the expected relative directions between datasets in the constraints_vector. This is useful when measuring different data types or different conditions.
To investigate the transcriptomic differences following gene knockdown or overexpression, we would provide constraints_vector values that reflect the expected opposing relationship between gene knockdown and gene overexpression.
The constraints_vector is also useful when merging different data types such as gene and protein expression, promoter methylation, and chromatin accessibility. Importantly, the expected relative direction between each datatype must be closely considered given the experimental conditions. For example, when combining gene expression, protein expression, and promoter methylation data measured in the same biological condition, we would provide a constraints_vector to reflect the expected agreement between gene and protein expression and disagreement with promoter methylation.
The intuition for merging these datasets is that direction of change in gene expression and protein expression tend to associate with each other according to the central dogma of biology while the direction of change of promoter methylation has the opposite effect as methylation insulates genes from transcription.
To explore how changes on the individual gene level impact biological pathways, we can compare results before and after incorporating a directional penalty.
Use the example GMT file embedded in the package.
First perform integrative pathway enrichment analysis with no directionality. Then perform directional integration and pathway enrichment analysis. For this analysis the directional coefficients and constraints_vector are from the gene-based analysis described above.
enriched_pathways <- ActivePathways(
pval_matrix, gmt = fname_GMT2, cytoscape_file_tag = "Original_")
constraints_vector <- c(1,1)
constraints_vector
## [1] 1 1
## rna protein
## ACTN4 1 1
## PIK3R4 1 -1
## PPIL1 -1 -1
## NELFE -1 -1
## LUZP1 1 1
## ITGB2 1 1
enriched_pathways_directional <- ActivePathways(
pval_matrix, gmt = fname_GMT2, cytoscape_file_tag = "Directional_", merge_method = "DPM",
scores_direction = dir_matrix, constraints_vector = constraints_vector)
Examine the pathways that are lost when directional information is incorporated in the data integration.
pathways_lost_in_directional_integration =
setdiff(enriched_pathways$term_id, enriched_pathways_directional$term_id)
pathways_lost_in_directional_integration
## [1] "REAC:R-HSA-3858494" "REAC:R-HSA-69206" "REAC:R-HSA-69242"
## [4] "REAC:R-HSA-9013149"
## term_id term_name adjusted_p_val
## <char> <char> <num>
## 1: REAC:R-HSA-3858494 Beta-catenin independent WNT signaling 0.013437464
## 2: REAC:R-HSA-69206 G1/S Transition 0.026263457
## 3: REAC:R-HSA-69242 S Phase 0.009478766
## term_size overlap evidence
## <int> <list> <list>
## 1: 143 PSMA5,PSMB4,PSMC5,PSMD11,PSMA8,GNG13,... rna,protein
## 2: 130 PSMA5,PSMB4,CDK4,PSMC5,PSMD11,CCNB1,... protein
## 3: 162 PSMA5,PSMB4,RFC3,CDK4,PSMC5,PSMD11,... combined
## Genes_rna
## <list>
## 1: GNG13,PSMC1,PSMA5,PSMB4,ITPR3,DVL1,...
## 2: NA
## 3: NA
## Genes_protein
## <list>
## 1: PSMA8,PSMD11,PSMA5,PRKG1,PSMD10,PSMB4,...
## 2: PSMD11,PSMA5,PSMD10,PSMB4,CDK7,ORC2,...
## 3: NA
## [ reached getOption("max.print") -- omitted 1 row ]
An example of a lost pathway is Beta-catenin independent WNT signaling. Out of the 32 genes that contribute to this pathway enrichment, 10 genes are in directional conflict. The enrichment is no longer identified when these genes are penalised due to the conflicting log2 fold-change directions.
wnt_pathway_id <- "REAC:R-HSA-3858494"
enriched_pathway_genes <- unlist(
enriched_pathways[enriched_pathways$term_id == wnt_pathway_id,]$overlap)
enriched_pathway_genes
## [1] "PSMA5" "PSMB4" "PSMC5" "PSMD11" "PSMA8" "GNG13" "SMURF1" "PSMC1"
## [9] "PSMA4" "PLCB2" "PRKG1" "PSMD4" "PSMD1" "PSMD10" "PSMA6" "PSMA2"
## [17] "PSMA1" "PRKCA" "PSMC6" "RHOA" "PSMB3" "PSMB1" "PSME3" "ITPR3"
## [25] "AGO4" "DVL3" "PSMA3" "PPP3R1" "DVL1" "CLTA" "PSME2" "CALM1"
## [33] "PSMD6" "PSMB6"
Examine the pathway genes that have directional disagreement and contribute to the lack of pathway enrichment in the directional analysis
pathway_gene_pvals = pval_matrix[enriched_pathway_genes,]
pathway_gene_directions = dir_matrix[enriched_pathway_genes,]
directional_conflict_genes = names(which(
pathway_gene_directions[,1] != pathway_gene_directions[,2] &
pathway_gene_directions[,1] != 0 & pathway_gene_directions[,2] != 0))
pathway_gene_pvals[directional_conflict_genes,]
## rna protein
## PSMD11 0.34121101 0.002094310
## PSMA8 0.55510836 0.001415197
## SMURF1 0.03353629 0.042995333
## PSMD1 0.04650877 0.100178048
## RHOA 0.01786687 0.474628084
## PSME3 0.07148904 0.130184883
## ITPR3 0.01660850 0.589929787
## DVL3 0.46381447 0.022535743
## PSME2 0.03274707 0.514351089
## PSMB6 0.02863259 0.677224905
## rna protein
## PSMD11 1 -1
## PSMA8 1 -1
## SMURF1 1 -1
## PSMD1 1 -1
## RHOA -1 1
## PSME3 1 -1
## ITPR3 1 -1
## DVL3 1 -1
## PSME2 -1 1
## PSMB6 -1 1
## [1] 10
To visualise differences in biological pathways between ActivePathways analyses with or without a directional penalty, we combine both outputs into a single enrichment map for plotting.
A pathway is considered to be significantly enriched if it has
adjusted_p_val <= significant
. The parameter
significant
represents the maximum adjusted P-value for a
resulting pathway to be considered statistically significant. Only the
significant pathways are returned. P-values from pathway enrichment
analysis are adjusted for multiple testing correction to provide a more
conservative analysis (see below).
## [1] 33
## [1] 36
In the most common use case, a GMT file is downloaded from a database
and provided directly to ActivePathways
as a location in
the file system. In some cases, it may be useful to load a GMT file
separately for preprocessing. The ActivePathways package includes an
interface for working with GMT objects. The GMT object can be read from
a file using the read.GMT
function. The GMT is structured
as a list of terms (e.g., molecular pathways, biological processes,
etc.). In the GMT object, each term is a list containing an id, a name,
and the list of genes associated with this term.
## [1] "id" "name" "genes"
## REAC:1630316 - Glycosaminoglycan metabolism
## ST3GAL1, CHP1, B4GALT5, ST3GAL4, SLC9A1, CHST11, B3GAT3, SLC26A1, ARSB, ABCC5, LUM, PAPSS1, SDC4, NAGLU, AC022400.6, HYAL1, NDST3, SLC35B2, B3GAT1, CHST2, DCN, CEMIP, IDS, IDUA, HS3ST3B1, B3GNT7, CHST12, CHST14, BCAN, B4GALT3, HS3ST1, B4GALT1, CSPG4, CHPF2, HEXB, UST, XYLT1, NDST2, AL050331.3, NAT6, CHSY3, NCAN, FMOD, B3GNT2, HPSE, ACAN, LYVE1, GPC2, GPC1, B4GALT2, SLC35B3, KERA, GPC4, GUSB, HYAL3, HS6ST3, HAS3, CHPF, OMD, SDC2, B3GNT3, CSGALNACT2, CSPG5, BGN, CHST6, HS6ST2, SDC3, HS2ST1, CD44, AGRN, B3GALT6, PRELP, GLB1, GPC6, B3GNT4, PAPSS2, HS3ST4, CHST9, GNS, CHST1, CSGALNACT1, HS6ST1, CHST7, AL590542.1, HYAL2, HAS1, CHSY1, OGN, B4GAT1, B4GALT4, DSEL, EXT1, SDC1, SLC35D2, EXT2, ST3GAL2, ST3GAL6, CHST3, CHST5, HSPG2, SLC26A2, HEXA, XYLT2, HS3ST6, HS3ST5, GLCE, AC244197.3, B4GALT7, HAS2, NDST4, CHST15, B4GALT6, B3GAT2, STAB2, HMMR, ST3GAL3, VCAN, NDST1, HS3ST2, GPC5, HS3ST3A1, HPSE2, GLB1L, SGSH, GPC3, CHST13, DSE
##
## REAC:5633007 - Regulation of TP53 Activity
## PPP1R13L, RFC3, SGK1, HDAC1, SETD9, RPA3, CSNK2A2, EHMT1, PDPK1, CCNG1, TAF6, ING5, TP53INP1, BRCA1, TAF1, TAF9, EHMT2, RBBP8, CHD4, USP2, BRPF1, BANP, PPP2R5C, SSRP1, TAF4, TMEM55B, PPP1R13B, TAF10, PRR5, ATM, ZNF385A, MRE11, PIP4K2C, SMYD2, BARD1, MBD3, TP53BP2, AURKA, RPA1, KAT6A, CHD3, ATR, EXO1, PIN1, RMI2, PRKAB2, AKT2, PPP2CA, TAF13, RAD9B, PPP2CB, MDM4, POU4F1, TAF4B, MDM2, BRD1, CDK2, CSNK2B, PIP4K2A, RAD50, PML, AKT3, RAD1, USP7, PRDM1, RFC4, MTOR, POU4F2, RFC2, KMT5A, DNA2, NUAK1, PIP4K2B, MAPKAP1, PPP2R1A, RBBP7, MTA2, TP53RK, TAF1L, GATAD2B, JMY, TAF12, CDK5R1, RICTOR, PHF20, AC134772.2, STK11, TAF3, BLM, ATRIP, TAF7L, TP73, MAP2K6, RPA2, PRKAG1, TOPBP1, PRKAB1, TOP3A, CCNA1, RBBP4, PPP2R1B, DAXX, TTC5, CHEK2, CDK5, RMI1, TAF2, NOC2L, TP63, CSNK2A1, UBB, PRKAG2, ING2, RNF34, MLST8, HIPK2, TBP, L3MBTL1, EP300, MAPK11, BRIP1, PRMT5, PRKAG3, RFC5, NBN, TPX2, HDAC2, CDK1, MAPKAPK5, CHEK1, TAF5, BRPF3, RPS27A, AKT1, GATAD2A, RHNO1, CCNA2, HUS1, MEAF6, PRKAA1, CDKN2A, RAD17, KAT5, SUPT16H, TAF7, PLK3, UBA52, WRN, MAPK14, TAF9B, UBC, HIPK1, TP53, DYRK2, RFFL, TAF11, BRD7, RAD9A, AURKB, PRKAA2
##
## REAC:5358346 - Hedgehog ligand biogenesis
## PSMB11, PSMD8, PSMB6, PSMB8, PSMD13, PSMA2, PSMB3, PSMB7, PSME2, PSMD1, PSMA8, PSMA1, PSMD2, PSMD14, PSMC1, IHH, SEL1L, SCUBE2, PSMB10, PSMD10, PSMA5, PSMD3, PSMC2, PSMC6, PSME1, PSME4, PSMC4, AC010132.3, PSMD5, UBB, GPC5, PSMA7, PSMB2, PSMD9, SEM1, PSMA3, DISP2, PSMB9, PSMD6, OS9, PSMC3, PSMA6, RPS27A, ADAM17, PSMB1, DHH, PSMF1, HHAT, UBA52, UBC, DERL2, PSMD4, VCP, ERLEC1, PSME3, PSMB4, P4HB, PSMA4, PSMD7, PSMD11, PSMC5, NOTUM, PSMD12, PSMB5, SYVN1, SHH
## [1] "ST3GAL1" "CHP1" "B4GALT5" "ST3GAL4" "SLC9A1"
## [6] "CHST11" "B3GAT3" "SLC26A1" "ARSB" "ABCC5"
## [11] "LUM" "PAPSS1" "SDC4" "NAGLU" "AC022400.6"
## [16] "HYAL1" "NDST3" "SLC35B2" "B3GAT1" "CHST2"
## [21] "DCN" "CEMIP" "IDS" "IDUA" "HS3ST3B1"
## [26] "B3GNT7" "CHST12" "CHST14" "BCAN" "B4GALT3"
## [31] "HS3ST1" "B4GALT1" "CSPG4" "CHPF2" "HEXB"
## [ reached getOption("max.print") -- omitted 92 entries ]
## [1] "DAP12 signaling"
The most common processing step for GMT files is the removal of gene sets that are too large or small. Here we remove pathways (gene sets) that have less than 10 or more than 500 annotated genes.
gmt <- Filter(function(term) length(term$genes) >= 10, gmt)
gmt <- Filter(function(term) length(term$genes) <= 500, gmt)
The new GMT object can now be used for analysis with
ActivePathways
## term_id term_name adjusted_p_val term_size
## <char> <char> <num> <int>
## 1: REAC:2424491 DAP12 signaling 0.0002890847 358
## 2: REAC:177929 Signaling by EGFR 0.0015162274 366
## overlap evidence
## <list> <list>
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## Genes_X3UTR Genes_X5UTR
## <list> <list>
## 1: NA NA
## 2: NA NA
## Genes_CDS Genes_promCore
## <list> <list>
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 2: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## [ reached getOption("max.print") -- omitted 29 rows ]
This filtering can also be done automatically using the
geneset_filter
option to the ActivePathways
function. By default, ActivePathways
removes gene sets with
less than five or more than a thousand genes from the GMT prior to
analysis. In general, gene sets that are too large are likely not
specific and less useful in interpreting the data and may also cause
statistical inflation of enrichment scores in the analysis. Gene sets
that are too small are likely too specific for most analyses and make
the multiple testing corrections more stringent, potentially causing
deflation of results. A stricter filter can be applied by running
ActivePathways
with the parameter
geneset_filter = c(10, 500)
.
## term_id term_name adjusted_p_val term_size
## <char> <char> <num> <int>
## 1: REAC:2424491 DAP12 signaling 3.660807e-05 358
## 2: REAC:177929 Signaling by EGFR 5.039994e-04 366
## overlap evidence
## <list> <list>
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## Genes_X3UTR Genes_X5UTR
## <list> <list>
## 1: NA NA
## 2: NA NA
## Genes_CDS Genes_promCore
## <list> <list>
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 2: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## [ reached getOption("max.print") -- omitted 29 rows ]
This GMT object can be saved to a file
To perform pathway enrichment analysis, a global set of genes needs to be defined as a statistical background set. This represents the universe of all genes in the organism that the analysis can potentially consider. By default, this background gene set includes every gene that is found in the GMT file in any of the biological processes and pathways. Another option is to provide the full set of all protein-coding genes, however, this may cause statistical inflation of the results since a sizable fraction of all protein-coding genes still lack any known function.
Sometimes the statistical background set needs to be considerably narrower than the GMT file or the full set of genes. Genes need to be excluded from the background if the analysis or experiment specifically excluded these genes initially. An example would be a targeted screen or sequencing panel that only considered a specific class of genes or proteins (e.g., kinases). In analysing such data, all non-kinase genes need to be excluded from the background set to avoid statistical inflation of all gene sets related to kinase signalling, phosphorylation and similar functions.
To alter the background gene set in ActivePathways
, one
can provide a character vector of gene names that make up the
statistical background set. In this example, we start from the original
list of genes in the entire GMT and remove one gene, the tumor
suppressor TP53. The new background set is then used for the
ActivePathways analysis.
background <- makeBackground(gmt)
background <- background[background != 'TP53']
ActivePathways(scores, gmt_file, background = background)
## term_id term_name adjusted_p_val term_size
## <char> <char> <num> <int>
## 1: REAC:2424491 DAP12 signaling 0.0022333168 358
## 2: REAC:422475 Axon guidance 0.0001161168 555
## overlap evidence
## <list> <list>
## 1: PIK3CA,KRAS,PTEN,BRAF,NRAS,B2M,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR
## Genes_X3UTR Genes_X5UTR
## <list> <list>
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## Genes_CDS Genes_promCore
## <list> <list>
## 1: PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,... NA
## 2: NA NA
## [ reached getOption("max.print") -- omitted 29 rows ]
Note that only the genes found in the background set are used for
testing enrichment. Any genes in the input data that are not in the
background set will be automatically removed by
ActivePathways
.
A key feature of ActivePathways
is the integration of
multiple complementary omics datasets to prioritise genes for the
pathway analysis. In this approach, genes with significant scores in
multiple datasets will get the highest priority, and certain genes with
weak scores in multiple datasets may be ranked higher, highlighting
functions that would be missed when only single datasets were analysed.
ActivePathways
accomplishes this by merging the series of
p-values in the columns of the scores matrix for each gene into a single
combined P-value. The four methods to merge P-values are Fisher’s method
(the default), Brown’s method (extension of Fisher’s), Stouffer’s method
and Strube’s method (extension of Stouffer’s). Each of these methods
have been extended to account for the directional activity of genes
across multi-omics datasets with Fisher_directional, DPM,
Stouffer_directional, and Strube_directional methods. The Brown’s and
Strube’s methods are more conservative in the case when the input
datasets show some large-scale similarities (i.e., covariation), since
they will take that into account when prioritising genes across similar
datasets. The Brown’s or Strube’s method are recommended for most cases
since omics datasets are often not statistically independent of each
other and genes with high scores in one dataset are more likely to have
high scores in another dataset just by chance.
The following example compares the merged P-values of the first few genes between the four methods. Fisher’s and Stouffer’s method are two alternative strategies to merge p-values and as a result the top scoring genes and p-values may differ. The genes with the top scores for Brown’s method are the same as Fisher’s, but their P-values are more conservative. This is the case for Strube’s method as well, in which the top scoring genes are the same as Stouffer’s method, but the P-values are more conservative.
## TP53 VHL PIK3CA KRAS PTEN
## 2.275933e-32 4.677780e-31 9.878802e-30 5.169163e-29 5.813021e-29
## TP53 VHL PIK3CA KRAS PTEN
## 2.850936e-21 1.933490e-20 1.334809e-19 3.808817e-19 4.102937e-19
## TP53 VHL PIK3CA KRAS PTEN
## 5.889756e-18 1.228308e-15 3.067765e-15 1.396875e-13 2.275649e-13
## TP53 VHL PIK3CA KRAS PTEN
## 1.170932e-11 3.249661e-10 5.746455e-10 6.206504e-09 8.414282e-09
This function can be used to combine some of the data before the
analysis for any follow-up analysis or visualisation. For example, we
can merge the columns X5UTR
, X3UTR
, and
promCore
into a single non_coding
column
(these correspond to predictions of driver mutations in 5’UTRs, 3’UTRs,
and core promoters of genes, respectively). This will consider the three
non-coding regions as a single column, rather than giving them all equal
weight to the CDS
column.
scores2 <- cbind(scores[, 'CDS'], merge_p_values(scores[, c('X3UTR', 'X5UTR', 'promCore')], 'Brown'))
colnames(scores2) <- c('CDS', 'non_coding')
scores[c(2179, 1760),]
## X3UTR X5UTR CDS promCore
## TP53 0.8331532 0.005613463 1.842404e-33 0.0254239
## PTEN 0.1627596 0.310132725 2.180301e-32 0.6867016
## CDS non_coding
## TP53 1.842404e-33 0.0189964
## PTEN 2.180301e-32 0.3437851
## term_id term_name adjusted_p_val term_size
## <char> <char> <num> <int>
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 4.625918e-04 555
## overlap evidence
## <list> <list>
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## Genes_X3UTR Genes_X5UTR
## <list> <list>
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## Genes_CDS
## <list>
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## Genes_promCore
## <list>
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## [ reached getOption("max.print") -- omitted 32 rows ]
## term_id term_name adjusted_p_val term_size
## <char> <char> <num> <int>
## 1: REAC:2424491 DAP12 signaling 0.0003694968 358
## 2: REAC:422475 Axon guidance 0.0051873493 555
## 3: REAC:177929 Signaling by EGFR 0.0014545468 366
## overlap evidence
## <list> <list>
## 1: TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,RPS6KA3,CALM2,... combined
## 3: TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,... CDS
## Genes_CDS Genes_non_coding
## <list> <list>
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## 2: NA NA
## 3: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,... NA
## [ reached getOption("max.print") -- omitted 27 rows ]
To perform pathway enrichment of the ranked gene list of merged
P-values, ActivePathways
defines a P-value cutoff to filter
genes that have little or no significance in the series of omics
datasets. This threshold represents the maximum p-value for a gene to be
considered of interest in our analysis. The threshold is
0.1
by default but can be changed using the
cutoff
option. The default option considers raw P-values
that have not been adjusted for multiple-testing correction. Therefore,
the default option provides a relatively lenient approach to filtering
the input data. This is useful for finding additional genes with weaker
signals that associate with well-annotated and strongly significant
genes in the pathway and functional context.
## [1] 33
## [1] 18
Multiple testing correction is essential in the analysis of omics
data since each analysis routinely considers thousands of hypotheses and
apparently significant P-values will occur by chance alone.
ActivePathways
uses multiple testing correction at the
level of pathways as P-values from the ranked hypergeometric test are
adjusted for multiple testing (note that the ranked gene list provided
to the ranked hypergeometric test remain unadjusted for multiple testing
by design).
The package uses the p.adjust
function of base R to run
multiple testing corrections and all methods in this function are
available. By default, ‘holm’ correction is used. The option
correction_method = 'none'
can be used to override P-value
adjustment (not recommended in most cases).
## [1] 33
## [1] 88
Consider the results object from the basic use case of
ActivePathways
## term_id term_name adjusted_p_val term_size
## <char> <char> <num> <int>
## 1: REAC:2424491 DAP12 signaling 4.491268e-05 358
## 2: REAC:422475 Axon guidance 4.625918e-04 555
## overlap evidence
## <list> <list>
## 1: TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,... CDS
## 2: PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
## Genes_X3UTR Genes_X5UTR
## <list> <list>
## 1: NA NA
## 2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,... NA
## Genes_CDS
## <list>
## 1: TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
## 2: NA
## Genes_promCore
## <list>
## 1: NA
## 2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
## [ reached getOption("max.print") -- omitted 32 rows ]
The columns term_id
, term_name
, and
term_size
give information about each pathway detected in
the enrichment analysis. The adjusted_p_val
column with the
adjusted P-value indicates the confidence that the pathway is enriched
after multiple testing correction.
The overlap
column provides the set of genes from the
integrated gene list that occur in the given enriched gene set (i.e.,
molecular pathway or biological process). These genes were quantified
across multiple input omics datasets and prioritized based on their
joint significance in the input data. Note that the genes with the
strongest scores across the multiple datasets are listed first.
## [[1]]
## [1] "TP53" "PIK3CA" "KRAS" "PTEN" "BRAF" "NRAS" "B2M" "CALM2"
## [9] "CDKN1A" "CDKN1B"
##
## [[2]]
## [1] "PIK3CA" "KRAS" "BRAF" "NRAS" "CALM2" "RPS6KA3" "ACTB"
## [8] "EFNA1" "SCN2B" "EPHA2" "GAP43" "COL4A1" "RASAL2" "CLTC"
## [15] "IQGAP1" "NF1" "FGF9" "ADAM10" "PTPRC" "ITGA10" "PDGFB"
## [22] "COL4A2" "RGMB" "RASA1" "FGF6" "CALM1" "PSMB7" "PPP2R5D"
## [29] "PPP2R1A" "RAP1A" "CNTNAP1" "GRIN2B" "GRB2" "DLG1" "MET"
## [ reached getOption("max.print") -- omitted 6 entries ]
##
## [[3]]
## [1] "TP53" "PIK3CA" "KRAS" "PTEN" "BRAF" "NRAS" "CALM2" "CDKN1A"
## [9] "CDKN1B"
This column is useful for further data analysis, allowing the researcher to go from the space of enriched pathways back to the space of individual genes and proteins involved in pathways and their input omics datasets.
The evidence
column provides insights to which of the
input omics datasets (i.e., columns in the scores matrix) contributed to
the discovery of this pathway or process in the integrated enrichment
analysis. To achieve this level of detail, ActivePathways
also analyses the gene lists ranked by the individual columns of the
input matrix to detect enriched pathways. The evidence
column lists the name of a given column of the input matrix if the given
pathway is detected both in the integrated analysis and the analysis of
the individual column. For example, in this analysis the majority of the
detected pathways have only ‘CDS’ as their evidence, since these
pathways were found to be enriched in data fusion through P-value
merging and also by analysing the gene scores in the column
CDS
(for reference, CDS corresponds to protein-coding
sequence where the majority of known driver mutations have been found).
As a counter-example, the record for the pathway
REAC:422475
in our results lists as evidence
list('X3UTR', 'promCore')
, meaning that the pathway was
found to be enriched when considering either the X3UTR
column, the promCore
column, or the combined omics
datasets.
## evidence1 evidence2
## "X3UTR" "promCore"
Finally, if a pathway is found to be enriched only with the combined data and not in any individual column, ‘combined’ will be listed as the evidence. This subset of results may be particularly interesting since it highlights complementary aspects of the analysis that would remain hidden in the analysis of any input omics dataset separately.
The following columns named as Genes_{column}
help
interpret how each pathway was detected in the multi-omics data
integration, as listed in the column evidence
. These
columns show the genes present in the pathway and any of the input omics
datasets. If the given pathway was not identified using the scores of
the given column of the input scores matrix, an NA
value is
shown. Again, the genes are ranked by the significance of their scores
in the input data, to facilitate identification of the most relevant
genes in the analysis.
The results are returned as a data.table
object due to
some additional data structures needed to store lists of gene IDs and
supporting evidence. The usual R functions write.table
and
write.csv
will struggle with exporting the data unless the
gene and evidence lists are manually transformed as strings.
Fortunately, the fwrite
function of data.table
can be used to write the file directly and the ActivePathways package
includes the function export_as_CSV
as a shortcut that uses
the vertical bar symbol to concatenate gene lists.
result_file <- paste('ActivePathways_results.csv', sep = '/')
export_as_CSV (res, result_file) # remove comment to run
read.csv(result_file, stringsAsFactors = F)[1:3,]
The fwrite
can be called directly for customised
output.
The Cytoscape software and the EnrichmentMap app provide powerful
tools to visualise the enriched pathways from
ActivePathways
as a network (i.e., an Enrichment Map). To
facilitate this visualisation step, ActivePathways
provides
the files needed for building enrichment maps. To create these files, a
file prefix must be supplied to ActivePathways
using the
argument cytoscape_file_tag
. The prefix can be a path to an
existing writable directory.
Four files are written using the prefix:
enrichmentMap__pathways.txt
contains the table of
significant terms (i.e. molecular pathways, biological processes, other
gene sets) and the associated adjusted P-values. Note that only terms
with adjusted_p_val <= significant
are written.
enrichmentMap__subgroups.txt
contains a matrix
indicating the columns of the input matrix of P-values that contributed
to the discovery of the corresponding pathways. These values correspond
to the evidence
evaluation of input omics datasets
discussed above, where a value of one indicates that the pathway was
also detectable using a specific input omics dataset. A value of zero
indicates otherwise. This file will not be generated if a single-column
matrix of scores corresponding to just one omics dataset is provided to
ActivePathways
.
enrichmentMap__pathways.gmt
contains a shortened
version of the supplied GMT file which consists of only the significant
pathways detected by ActivePathways
.
enrichmentMap__legend.pdf
is a pdf file that
displays a color legend of different omics datasets visualised in the
enrichment map that can be used as a reference.
The following sections will discuss how to create a pathway
enrichment map using the results from ActivePathways
. The
datasets analysed earlier in the vignette will be used. To follow the
steps, save the required files from ActivePathways
in an
accessible location.
ActivePathways
writes four files that are used to build
enrichment maps in Cytoscape.
files <- c(system.file('extdata', 'enrichmentMap__pathways.txt', package='ActivePathways'),
system.file('extdata', 'enrichmentMap__subgroups.txt', package='ActivePathways'),
system.file('extdata', 'enrichmentMap__pathways.gmt', package='ActivePathways'),
system.file('extdata', 'enrichmentMap__legend.pdf', package='ActivePathways'))
The following commands will perform the basic analysis again and
write output files required for generating enrichment maps into the
current working directory of the R session. All file names use the
prefix enrichmentMap__
. The generated files are also
available in the ActivePathways
R package as shown
above.
gmt_file <- system.file('extdata', 'hsapiens_REAC_subset.gmt', package = 'ActivePathways')
scores_file <- system.file('extdata', 'Adenocarcinoma_scores_subset.tsv', package = 'ActivePathways')
scores <- read.table(scores_file, header = TRUE, sep = '\t', row.names = 'Gene')
scores <- as.matrix(scores)
scores[is.na(scores)] <- 1
res <- ActivePathways(scores, gmt_file, cytoscape_file_tag = "enrichmentMap__")
The four files written are:
enrichmentMap__pathways.txt
, a table of significant
pathways and the associated adjusted P-values.
enrichmentMap__subgroups.txt
, a table of pathways
and corresponding omics datasets supporting the enrichment of those
pathways. This corresponds to the evidence
column of the
ActivePathways
result object discussed above.
enrichmentMap__pathways.gmt
, a shortened version of
the supplied GMT file which consists of only the significant pathways
detected by ActivePathways
.
enrichmentMap__legend.pdf
, a reference color legend
of different omics datasets visualised in the enrichment map.
The following code will examine a few lines of the files generated by
ActivePathways
.
## term_id term_name adjusted_p_val
## REAC:2424491 DAP12 signaling 4.49126833230489e-05
## REAC:422475 Axon guidance 0.00046259184602835
## REAC:177929 Signaling by EGFR 0.000619750411866923
## REAC:2559583 Cellular Senescence 6.59544666946083e-05
## term_id X3UTR X5UTR CDS promCore combined instruct
## REAC:2424491 0 0 1 0 0 piechart: attributelist="X3UTR,X5UTR,CDS,promCore,combined" colorlist="#FF0000,#CCFF00,#00FF66,#0066FF,#FFFFF0" showlabels=FALSE
## REAC:422475 1 0 0 1 0 piechart: attributelist="X3UTR,X5UTR,CDS,promCore,combined" colorlist="#FF0000,#CCFF00,#00FF66,#0066FF,#FFFFF0" showlabels=FALSE
## REAC:177929 0 0 1 0 0 piechart: attributelist="X3UTR,X5UTR,CDS,promCore,combined" colorlist="#FF0000,#CCFF00,#00FF66,#0066FF,#FFFFF0" showlabels=FALSE
## REAC:2559583 0 0 1 0 0 piechart: attributelist="X3UTR,X5UTR,CDS,promCore,combined" colorlist="#FF0000,#CCFF00,#00FF66,#0066FF,#FFFFF0" showlabels=FALSE
## REAC:1257604 PIP3 activates AKT signaling MTOR ERBB4 NR4A1 MET TSC2 AGO1 AKT3 PIP4K2A TNRC6B TRIB3 FOXO3 GSK3A MDM2 FGF1 PIP5K1A PHLPP1 PPP2CB KLB PPP2CA FGF8 CREB1 AKT2 CDKN1B BAD FGF9 PPP2R5E PIK3AP1 FOXO4 RPS6KB2 AKT1S1 AL358075.4 FGF2 ICOS BTC CDKN1A KITLG FGF17 PIK3CB PIK3R2 FGF4 FGF18 PIP4K2C FGF7 PDGFA PRR5 EREG PPP2R5B FGF19 FGFR1 VAV1 ERBB3 HBEGF FGF3 PPP2R5C PDGFRB PDGFB TNRC6C CD80 FGFR3 MIR26A2 TRAT1 FGF23 AGO3 PTPN11 PIP5K1B GAB1 MAPK1 LCK NRG1 TNRC6A MIR26A1 PDPK1 FGF22 CHUK FOXO1 NRG2 PIK3R1 FGF20 FGFR2 PPP2R5D SRC PIK3CA TP53 NRG4 MOV10 KIT PIP5K1C NRG3 PTEN CD19 INS HGF AGO2 PHLPP2 IRS2 PIK3CD FGFR4 PPP2R5A AKT1 AL672043.1 KL PIK3R3 EGFR FRS2 IER3 INSR MLST8 CD28 FYN AGO4 FGF16 THEM4 FGF5 IRS1 PPP2R1B FGF10 EGF MAPK3 GSK3B PDGFRA RICTOR ERBB2 CASP9 GRB2 PPP2R1A FGF6 MAPKAP1 PIP4K2B INS-IGF2 CD86
## REAC:212436 Generic Transcription Pathway ZNF776 ZNF736 ZNF700 CHM ZNF658B TP53AIP1 CASP10 RARG ZNF235 HUS1 ZNF200 ZNF17 ZNF664 RRM2B TFAP2C ZNF786 PSMB1 ZNF737 TAF5 NDUFA4 COX6B1 TOM1L1 ZNF493 ZNF649 NBN ZNF557 RFC5 POLR2I ZNF213 PRMT5 ZNF600 ZNF418 ZNF671 MAPK11 ZNF426 SRC TAF11 GTF2H4 BTG2 ZNF655 KIT ERCC2 ZNF677 ZNF709 ZNF234 ATAD2 SCO1 TAF9B ZNF28 ZNF691 MAPK14 ZNF584 ZNF317 CTGF ZNF445 ZNF492 ZNF771 KAT5 ZNF558 BCL6 TEAD3 TBL1X ZIM3 TAF7L ZNF157 ZNF136 ZNF415 ZNF792 BLM PSMD1 STK11 RICTOR NELFCD DDB2 ZNF446 SP1 ZNF727 NOTCH2 RUNX3 TRAPPC2 LAMTOR5 ZNF460 ZNF331 PSMB6 NKX2-5 ZKSCAN5 PPP2R1A MAPKAP1 CITED1 ZNF554 DNA2 MED1 SMAD3 ZNF681 CASP2 JUN ZNF20 ZNF839 ZNF14 CTNNB1 EP300 PSMC4 ZNF189 L3MBTL1 NDRG1 MED15 SPP1 ZNF567 RORB ZNF483 PRKAG2 KDM5B ZNF12 CCNT1 NOC2L CGA TNKS1BP1 NR2F1 RXRB RMI1 RABGGTB ZNF705E MT-CO3 CCNA1 ZNF697 GLS WWTR1 ESRRA BCL2L11 KRBOX4 PPP2CB CCND1 ZNF112 ZNF740 ZNF10 NR2E1 ZNF486 SMARCD3 PSMB9 PIN1 PSMF1 ELOC GTF2H5 NR2C2AP PSMC3 MYBL2 ZNF223 PSMA6 ZNF41 RABGGTA POLR2B ZNF225 CHD3 ZNF75A KAT6A MED26 HNF4A CNOT4 CENPJ RUNX2 DEK ZNF23 TCEA1 MBD3 ZNF540 BARD1 ZNF595 NR5A1 ZNF343 TXN PRDX1 ZNF33A WWOX MED24 BRD2 TSC2 POLR2D ZNF440 RAD1 GATA4 CTDP1 ZNF479 ZNF347 ZNF619 ZNF749 CSNK2B ZNF211 CDK13 CDK2 PPM1A ZNF256 FOXO3 ZNF433 SMURF2 ZNF37A COX11 ZNF227 TFAP2D ZNF33B ZNF133 PSMD14 BRPF1 ESRRB ZNF577 USP2 ITGAL PCNA UBE2I CITED2 YWHAE IGFBP3 ZNF224 ZNF772 BCL2L14 ZNF555 TAF1 BRCA1 NR3C2 NOP2 ZNF547 ZNF729 RBL2 TNRC6A ZNF706 CASP6 PLAGL1 CSNK2A2 SKIL HDAC1 RXRG CDKN2B PSME2 PSMB8 SMURF1 PSMA2 UBE2D3 PSMD13 AC010132.3 AC003002.1 CNOT11 ZFP1 PMS2 ATM ZNF385A ZNF304 ERCC3 ZNF519 TAF10 ZNF597 POLR2J PSME1 SUPT4H1 THRB TAF4 MNAT1 PARP1 ZNF160 ZNF230 PLK2 ZNF333 NR2E3 MAML3 ZNF684 BANP CNOT7 BNIP3L ZNF773 ZNF202 ZNF425 CDK9 TCF7L2 ZNF692 ZNF710 ZSCAN25 ZNF180 ZNF670 ZNF510 GATAD2A MT-CO1 ZNF552 COX14 RPS27A BRPF3 NRBP1 MAPKAPK5 HDAC2 CDK1 ZNF568 ZNF221 GTF2F1 COX5B BRIP1 AURKB ZNF208 TFDP1 ZNF667 COX7A2L RGCC RAD9A BRD7 ZNF717 PSMD12 ZNF571 POLR2F ZNF614 MYC UBA52 PTEN MED6 SUPT16H PSMD7 NR1H3 PRDX2 PSME3 CDKN2A PRKAA1 ZNF155 COX16 ATRIP SMAD2 ZFP14 NR5A2 ZNF468 ZNF607 GATAD2B RBBP7 PRDX5 RRAGA ZNF680 NUAK1 SMAD4 VEGFA ZNF471 NCOR1 ZNF79 TCF7 RXRA COX19 NPPA GTF2H3 RNF34 ZNF613 ING2 UBB AGO4 CSNK2A1 MED10 CITED4 ZNF778 ZNF282 PPARA CDK5 ZNF429 ZKSCAN1 PRKAB1 ELOA ZNF138 TMEM219 ZNF70 ZNF431 RPA2 ZNF26 SCO2 CCNH TRIM28 RARA ZKSCAN8 RUNX1 GLS2 ZNF735 PPP2CA ZNF543 RAD51D ZNF582 APAF1 ZNF720 ZNF782 EXO1 LAMTOR4 NR1H4 KCTD15 ZFP28 CNOT2 CCNE2 PSMA7 TGS1 RPA1 ZNF19 NR3C1 PSMA3 ZNF337 TP53BP2 ZNF586 ZSCAN32 SURF1 MTOR COX18 POLR2G SESN3 CNOT6 GTF2H1 PSMC5 ZNF714 NR4A1 AGO1 ZNF716 CNOT6L ZNF154 ZNF92 THRA ZNF775 RAD50 AR PSMD4 TBX5 BRD1 ZNF490 ZNF324B ZNF263 MAML2 ZNF212 PSMA4 E2F7 MED17 TAF4B MED27 POU4F1 ZNF254 PSMB4 ZNF18 ZIM2 ZIK1 ZNF790 ZNF641 TGIF1 ELOA3D ELOA3B CHD4 CRADD HELZ2 ZNF222 COX8A AGO3 RBBP8 ZNF561 ELL ZNF302 MDC1 MEN1 RORC HKR1 PDPK1 KAT2B PMAIP1 ZNF589 ZNF585A MED7 PRELID1 E2F5 NOTCH3 ZNF544 ZNF257 SMYD2 ZNF248 NR1H2 TFDP2 PIP4K2C ZNF267 ELOB TNFRSF10C PRMT1 ELOA2 CCNT2 ZNF253 YY1 TMEM55B ZNF573 ZNF559 TP53I3 ZNF724 NEDD4L ZFHX3 ZNF702P NRBF2 ZNF436 JAG1 ZNF443 ESRRG NR2F6 ZNF761 JUNB FANCC MEAF6 LEF1 ZNF514 NLRC4 CCNA2 ESR2 ZNF34 RBL1 ZNF606 CCNB1 BBC3 AKT1 TNFRSF10B POLR2E VDR ITGA4 ZNF668 MED13 SMAD7 ZNF682 ZNF774 CDK12 PSMD9 PSMB2 ZNF688 EGFR ZNF675 NR2C2 ZNF662 PIDD1 ZNF141 FOS ZNF777 PRKAA2 ZNF564 MED20 ZNF3 ZNF484 KRAS MED25 ZNF99 SUPT5H ZNF430 ZNF268 ZNF570 RFFL NR6A1 DYRK2 ZNF135 CASP1 UBC BAX WRN PLK3 LAMTOR3 TAF7 ZNF214 PITX2 ATP1B4 RAD17 ZNF708 CYCS NOTCH4 TAF3 ZNF705A COX20 POLR2A PHF20 CDK5R1 TAF12 ZNF383 PSMD10 ZNF703 ZNF705D ZKSCAN4 ZNF611 ZNF419 CGB8 ZNF197 HDAC4 TFAP2A CNOT8 POLR2L TRIM33 COX7C ZNF620 ZNF665 ZNF454 ZNF799 KMT5A AC002310.5 PSMB7 POU4F2 TGIF2 TBL1XR1 RPTOR TBP ZNF354C RARB CREBBP MLST8 ZNF77 ZNF566 RRAGD LAMTOR2 TACO1 TP63 FANCI CNOT3 ZNF626 PSMC2 COX6A1 CHEK2 CBFB ZFP37 ZNF287 ZNF718 DAXX RBBP4 CCNE1 NCOA1 TRIAP1 ZNF615 PRKAG1 PRDM7 NR1D1 GADD45A RAD9B TAF13 ZNF730 ZNF610 E2F4 ZNF721 CDC25C ZNF517 HSPD1 CDKN1B MED16 CCNK RHEB NCOA6 ATR ZNF300 ZNF770 MT-CO2 CGB5 ZNF696 MAMLD1 ZNF750 NR4A3 ZNF658 CDKN1A ZNF528 AURKA YAP1 MAML1 ZNF195 TEAD4 ZNF689 NOTCH1 CNOT1 RBPJ PERP ZNF599 ZNF556 NR1I3 ZNF205 ZNF439 CNOT10 AIFM2 ZNF506 ZNF669 PRDM1 ZNF266 NR0B2 ZNF100 COX4I1 YWHAZ TNRC6B ZNF746 YWHAH POLR2K ZNF624 ZNF184 ZNF548 ESR1 PRELID3A TCF7L1 ZFP69B PSMC1 GPI TNRC6C STEAP3 ZNF676 TSC1 MED14 EHMT2 YWHAQ ZNF114 ING5 TP53INP1 CCNG1 TIGAR ZNF530 GSR SETD9 ZNF804B ZNF747 ZNF354B ZNF678 PSMB3 YEATS4 PPP1R13L ZNF273 MRE11 ZNF587 G6PD ZNF585B PSMD5 NELFB ZNF75D NR1D2 ZNF101 ZNF699 ZNF473 TEAD1 ZNF660 ZNF767P PPP1R13B ZNF398 ZNF442 ZNF565 ZNF563 SERPINE1 PSMD3 CCNC SSRP1 ZNF350 COX7B ZNF711 ZNF285 PPP2R5C ZNF286A ZNF704 ZNF500 ZNF461 FANCD2 PPARG TNFRSF10D PSMD6 ZNF860 KCTD1 ZNF416 RHNO1 CARM1 SNW1 ZNF562 NR1I2 ZNF394 SESN1 ZNF583 CHEK1 LRPPRC ZNF264 CDK7 TPX2 ELOA3 NCOR2 PPARD PRKAG3 BID ZNF549 ZNF785 RRAGB KRBA1 TEAD2 MSH2 COX6C ZNF840P HES1 ZNF71 PSMB5 TP53 ZNF334 MOV10 HIPK1 ZNF529 ZNF354A TFAP2B ZNF485 MED4 HNF4G PSMD11 ZNF726 AGO2 ZNF43 MAP2K6 PSMA1 TP73 TXNRD1 LAMTOR1 AC134772.2 BIRC5 POLR2H COX5A ERBB2 JMY ZNF124 TAF1L TP53RK MTA2 ZNF791 MED30 NPM1 ZNF605 PIP4K2B ZFP30 ZNF496 ZNF226 RFC2 NELFE ZNF480 MED23 HIPK2 YWHAB MLH1 SLC38A9 PSMC6 TAF2 ZKSCAN7 ZNF30 TTC5 PCBP4 GTF2H2 PPP2R1B ZFP69 TOP3A TOPBP1 ZNF233 CDK8 USP9X ZNF546 ZNF701 CHD9 ZNF324 ZKSCAN3 AKT2 ZNF215 ARID3A PRKAB2 MED8 RMI2 ZNF793 ZNF732 ZNF74 ZNF738 MED31 SEM1 NELFA ZNF560 ZNF627 ZNF25 ZFP90 ZNF551 TNFRSF10A SUMO1 ZNF596 NR0B1 GPX2 ZNF311 APOE RFC4 NCOA2 ZNF705G E2F8 USP7 GTF2F2 ZNF250 AKT3 NR2C1 MED12 PML RRAGC ZNF713 PIP4K2A ZNF616 ZNF274 RORA ZNF569 CNOT9 RNF111 TGFB1 ZNF2 MDM2 ATF2 SFN MDM4 ZNF432 PSMD2 ZFP2 FAS MIR26A2 PSMB10 POLR2C E2F1 TAF9 ZNF621 TAF6 MIR26A1 CGB3 ZNF420 EHMT1 TGFA PSMD8 RPA3 ZNF625 PGR SGK1 KAT2A ZNF764 ZNF382 ZNF707 ZNF45 RFC3 TFAP2E SESN2 NR4A2 ZNF320 PRR5 ZNF470 ZNF140 ZNF441 SKI UBE2D1 ZNF175 DDIT4 ZNF679 PSMA5 YWHAG ZNF169 ZNF417 ZNF550
+
Add
Data Set from Files in the top left corner of the dialogue.enrichmentMap__pathways.txt
and
enrichmentMap__pathways.gmt
in the Enrichments and
GMT fields, respectively.To color nodes in the network (i.e., molecular pathways, biological
processes) according to the omics datasets supporting the enrichments,
the third file enrichmentMap__subgroups.txt
needs to be
imported to Cytoscape directly. To import the file, activate the menu
option File -> Import -> Table from File and select the
file enrichmentMap__subgroups.txt
. In the following
dialogue, select To a Network Collection in the dropdown menu
Where to Import Table Data. Click OK to proceed.
Next, Cytoscape needs to use the imported information to color nodes using a pie chart visualisation. To enable this, click the Style tab in the left control panel and select the Image/Chart1 Property in a series of dropdown menus (Properties -> Paint -> Custom Paint 1 -> Image/Chart 1).
The image/Chart 1 property now appears in the Style control panel. Click the triangle on the right, then set the Column to instruct and the Mapping Type to Passthrough Mapping.
This step colours the nodes corresponding to the enriched pathways
according to the supporting omics datasets, based on the scores matrix
initially analysed in ActivePathways
.
To allow better interpretation of the enrichment map,
ActivePathways
generates a color legend in the file
enrichmentMap__legend.pdf
that shows which colors
correspond to which omics datasets.
Note that one of the colors corresponds to a subset of enriched pathways with combined evidence that were only detected through data fusion and P-value merging and not when any of the input datasets were detected separately. This exemplifies the added value of integrative multi-omics pathway enrichment analysis.
From the drop-down Properties menu, select Border Line Type.
Set Column to directional impact and Mapping Type to Discrete Mapping. Now we can compare findings between a non-directional and a directional method. We highlight pathways that were shared (0), lost (1), and gained (2) between the approaches. Here, we have solid lines for the shared pathways, dots for the lost pathways, and vertical lines for the gained pathways. Border widths can be adjusted in the Border Width property, again with discrete mapping.
This step changes node borders in the aggregated enrichment map, depicting the additional information provided by directional impact.
For a more diverse range of colors, ActivePathways supports any color palette from RColorBrewer. The color_palette parameter must be provided.
res <- ActivePathways(scores, gmt_file, cytoscape_file_tag = "enrichmentMap__", color_palette = "Pastel1")
Instead, to manually input the color of each dataset the custom_colors parameter must be specified as a vector. This vector should contain the same number of colors as columns in the scores matrix.
res <- ActivePathways(scores, gmt_file, cytoscape_file_tag = "enrichmentMap__", custom_colors = c("violet","green","orange","red"))
To change the color of the combined contribution, a color must be provided to the color_integrated_only parameter.
Tip: if the coloring of nodes did not work in Cytoscape after setting the options in the Style panel, check that the EnhancedGraphics Cytoscape app is installed.
Mykhaylo Slobodyanyuk^, Alexander T. Bahcheli^, Zoe P. Klein, Masroor Bayati, Lisa J. Strug, Jüri Reimand. Directional integration and pathway enrichment analysis for multi-omics data. Nature Communications (2024) (^ - co-first authors) https://www.nature.com/articles/s41467-024-49986-4 https://pubmed.ncbi.nlm.nih.gov/38971800/.
Integrative Pathway Enrichment Analysis of Multivariate Omics Data. Paczkowska M, Barenboim J, Sintupisut N, Fox NS, Zhu H, Abd-Rabbo D, Mee MW, Boutros PC, PCAWG Drivers and Functional Interpretation Working Group; Reimand J, PCAWG Consortium. Nature Communications (2020) https://pubmed.ncbi.nlm.nih.gov/32024846/ https://doi.org/10.1038/s41467-019-13983-9.
Pathway Enrichment Analysis and Visualization of Omics Data Using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Reimand J, Isserlin R, Voisin V, Kucera M, Tannus-Lopes C, Rostamianfar A, Wadi L, Meyer M, Wong J, Xu C, Merico D, Bader GD. Nature Protocols (2019) https://pubmed.ncbi.nlm.nih.gov/30664679/ https://doi.org/10.1038/s41596-018-0103-9.