pARI
is an R package developed to a perform
permutation-based closed testing method. It computes a simultaneous
lower bound for the true discovery proportions of all possible subsets
of a hypothesis testing problem.
pARI
find the percentage of true discoveries for each
set of statistical tests while controlling the familywise error rate for
multiple testing and taking into account that the set was chosen in a
data-driven way.
Permutation theory adapts to the correlation structure, as a simultaneous method, it allows the decision of which hypotheses sets to analyze to be entirely flexible and post-hoc, that is, the user can choose it after seeing the data and revise the choice as often as he/she wants.
pARI
is entirely mild, flexible, and post-hoc. The
required input is the permutation p-values matrix, i.e., null p-values
distribution, that describes the p-values associated with each feature’s
statistical tests and permutation. pARI is valid if the
exchangeability assumption under the null hypothesis is satisfied for
the permutation procedure’s validity. If the permutation matrix is not
available, the user can directly insert the data specifying the type of
test to perform for each feature.
pARI
for each set of features, i.e., clusters, returns
the simultaneous lower confidence bound to the actual proportion of
significant features. The analysis can be carried out as many times as
the researcher wants; also, he/she can drill down into the cluster as
often as the user wants without making any selection error and ensuring
the family-wise error rate (FWER).
The pARI
package can be installed by
There are two main functions in the pARI
package.
The function pARIbrain
was developed for the fMRI
cluster analysis framework, while the function pARI
was
developed for every multiple-testing framework.
We perform a simple simulation using the function
simulateData
. \(1000\)
features are generated \(30\) times as
normally distributed with mean \(0\)
under the null hypothesis and mean under the alternative computed
considering the difference in means having the power of the one-sample
t-test equals \(0.8\). The proportion
of true null hypothesis equals \(\pi =
0.8\).
pARI
then computes the lower bound for the number of
true discoveries inside the set containing the first \(200\) features. The user must specify the
cluster in the ix
set, i.e., from \(1\) to \(200\) in this case. We apply the one-sample
t-test for each feature.
out <- pARI(X = datas, ix = c(1:200), test.type = "one_sample", seed = 123)
out$TDP
#> [,1]
#> [1,] 0.68
Therefore, we can say that at least 68 of features are truly
significant inside the ix
cluster.
However, the pARI
function can analyzed directly the
matrix of permuted p-values. We can compute it by signTest
function:
out <- signTest(X = datas, B = 1000, rand = F)
P <- cbind(out$pv, out$pv_H0)
pARI(pvalues = P, ix = c(1:200),test.type = "one_sample")$TDP
#> [,1]
#> [1,] 0.72
The set of features can also be expressed as a vector with length equals the number of features where different values indicate the different sets. For example, we can construct four random clusters as
ix <- sample(c(1:4), size = 1000, replace = T)
out <- pARI(pvalues = P, ix = ix,test.type = "one_sample", clusters = TRUE)$TDP
out
#> [1] 0.1298701 0.1150794 0.1198502 0.1000000
specifying in the pARI
function
cluster = TRUE
. Then, pARI
returns the lower
bound for the true discovery proportion for each set of features. We can
say that we have at least 12.987013 of truly active features in the
first cluster.
Let consider a simple example using Montgomery et al. (2010) and
Pickrell et. al (2010) data, i.e., an expression set that combines two
studies transcriptome genetics using second generation sequencing
HapMap. The data comprises \(52580\)
genes and \(129\) samples (i.e., \(60\) unrelated Caucasian individuals of
European descent and \(69\) unrelated
Nigerian individuals). After pre-processing steps, we perform a
two-sample t-test for each gene, and we define the sets of interest the
ones computed by the hclust
function.
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
if (!requireNamespace("dynamicTreeCut", quietly = TRUE)){
install.packages("dynamicTreeCut")
}
#BiocManager::install(c("Biobase","genefilter", "EnrichmentBrowser"))
library(Biobase)
library(genefilter)
library(dynamicTreeCut)
load(file=url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData"))
pdata<- pData(montpick.eset)
edata <- as.matrix(exprs(montpick.eset))
fdata <- fData(montpick.eset)
edata <- log2(as.matrix(edata) + 1)
edata <- edata[rowMeans(edata) > 10, ]
my.dist <- dist(edata)
my.tree <- hclust(my.dist, method="ward.D2")
my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist),minClusterSize=10))
Having the data edata
with labels referring to the type
of population, we use pARI
considering as ix
the hierarchical cluster analysis output.
out <-pARI(X = edata,alpha = 0.05, test.type = "two_samples",
label = as.factor(pdata$population),
ix = my.clusters, family = "simes", clusters = TRUE)
out$TDP
For each cluster computed by hclust
, pARI
returns the lower bound for the true discoveries proportion.
Alternatively, you can use as clusters the pathways of genes from the
getGenesets
function:
pathways <- EnrichmentBrowser::getGenesets(org = "hsa", db = "kegg", gene.id.type = "ENSEMBL")
out <- c()
for(i in seq(pathways)){
ix <- which(rownames(edata) %in% pathways[[i]])
if(length(ix)!=0){
out[i] <-pARI(X = edata,alpha = 0.05, test.type = "two_samples",
label = as.factor(pdata$population),
ix = ix, family = "simes", clusters = TRUE)$TDP
}else{
out[i] <- NA
}
}
db <- data.frame(TDP = out, size = sapply(seq(pathways), function(x) length(which(rownames(edata) %in% pathways[[x]]))), name = names(pathways))
So, we represent the lower confidence bounds for the proportion of differential expressed genes (TDP) inside each pathway in the following plot:
library(tidyverse)
db <- db %>% dplyr::filter(!is.na(TDP) | TDP!=0) %>% dplyr::filter(size >10)
db$name = factor(db$name, levels=db[order(db$TDP), "name"])
db %>% dplyr::filter(!is.na(TDP) | TDP!=0) %>% dplyr::filter(size >10) %>% arrange(TDP)%>% ggplot(aes(x = TDP, y = name, size = size)) + geom_point() +
xlab(expression(bar(pi)(S[m]))) + ylab("Pathways") + labs(size = expression(paste("|", S, "|"))) + theme_classic()+
scale_size_continuous(breaks = c(15, 25, 35, 45))
In this case, we filter out the pathways having less than \(10\) genes to simply the plot.
pARI
is particularly useful in functional Magnetic
Resonance Imaging cluster analysis, where it is of interest to select a
cluster of voxels and to provide a confidence statement on the
percentage of truly activated voxels within that cluster, avoiding the
well-known spatial specificity paradox.
We analyzed the Auditory data collected by Pernet et al. (2015), i.e., people listening vocal and non-vocal sounds.
First, let download the data from the fMRIdata
package:
if (!requireNamespace("fMRIdata", quietly = TRUE)){
remotes::install_github("angeella/fMRIdata")
}
library(fMRIdata)
data(Auditory_clusterTH3_2)
data(Auditory_copes)
data(Auditory_mask)
We have three ingredients:
Auditory_copes
as a list of
niftiImage
objects, one for each subject. The copes
represent the \(3\) dimensional (\(91 \times 109 \times 91\)) contrast map.
Each element of the array describes the estimated parameter used in the
hypotheses. In this case, the copes represent the statistics maps
regarding the contrast that describes the difference of neural
activation during vocal and non-vocal stimuli for each participant,
computed by FSL. The one-sample t-test is computed for each voxel to
analyze the hypothesis of zero mean across the subjects, i.e.,\[ H_0 : \mu_i = 0 \]
where \(\mu_i = \sum_{j = 1}^{J} copes_{ji}/J\), where \(J\) is the total number of subjects.
The cluster map Auditory_clusterTH3_2
is used as a
set of features in pARIbrain
. While our method allows any
method for forming clusters, we started from a map computed using Random
Field Theory (RFT) with a cluster-forming-threshold equalling \(3.2\).
The brain mask Auditory_mask
. In this case, we
extract it from the group-level analysis by FSL.
Finally, pARIbrain
can be used.
auditory_out <- pARIbrain(copes = Auditory_copes, clusters = Auditory_clusterTH3_2, mask = Auditory_mask, alpha = 0.05, silent = TRUE)
auditory_out$out
For each cluster, pARIbrain
returns the lower bounds of
the proportion of active voxels, the cluster’s size, the coordinates,
and the maximum statistical test value inside the cluster.
Finally, you can also produce the True Discovey Proportion brain map
using the map_TDP
function.
If you use the pARI
package, please cite the following
paper:
Andreella, A., Hemerik, J., Finos, L., Weeda, W., & Goeman, J. (2023). Permutation-based true discovery proportions for functional magnetic resonance imaging cluster analysis. Statistics in Medicine, 42(14), 2311-2340.