Below are some POSSA
sequential design power calculation
examples that cannot be done with previous common software.
library('POSSA')
This is a relatively small modification of the first t-test examples given in the introduction: Here it’s Welch’s test expecting unequal variances. The effect sizes and total samples are always the same as in the intro examples, for the easier comparison.
First, a function to have some added information along with the Welch’s test p value.
= function(x, y) {
wTestPlus = mean(x) - mean(y)
m_diff = sd(x)
sdx = sd(y)
sdy = length(x)
n1 = length(y)
n2 = sqrt(((n1 - 1) * (sdx ** 2) + (n2 - 1) * (sdy ** 2)) / (n1 + n2 - 2))
sd_p return(
list(
pval = stats::t.test(x, y, 'less')$p.value,
mean_diff = m_diff,
sd1 = sdx,
sd2 = sdy,
smd = m_diff / sd_p
)
) }
Then a function that generates two different sample sizes depending
on SDs
parameter: one with equal SDs, one with unequal
ones.
= function(sample1, sample2_h, SDs) {
sampsUnequal if (SDs == 'EqualSDs') {
= 10
sd1 = 10
sd2 else {
} = 5.57
sd1 = 13
sd2
}list(
sample1 = rnorm(sample1, mean = 0, sd = sd1),
sample2_h0 = rnorm(sample2_h, mean = 0, sd = sd2),
sample2_h1 = rnorm(sample2_h, mean = 5, sd = sd2)
) }
(The effect sizes are 0.5
for both SD variations:
neatStats::t_neat(bayestestR::distribution_normal(5000, 5, 10),bayestestR::distribution_normal(5000, 0, 10))$stats
,
neatStats::t_neat(bayestestR::distribution_normal(5000, 5, 13),bayestestR::distribution_normal(5000, 0, 5.57))$stats
.)
Then the testing function with some extra information returned (per H0 and H1: mean difference, the two SDs, and the SMD).
= function(sample1, sample2_h0, sample2_h1) {
wTestForUnequal = wTestPlus(sample1, sample2_h0)
t0 = wTestPlus(sample1, sample2_h1)
t1 return(
c(
p_h0 = t0$pval,
H0mDiff = t0$mean_diff,
H0sd1 = t0$sd1,
H0sd2 = t0$sd2,
H0smd = t0$smd,
p_h1 = t1$pval,
H1mDiff = t1$mean_diff,
H1sd1 = t1$sd1,
H1sd2 = t1$sd2,
H1smd = t1$smd
)
) }
(Check via
do.call(wTestForUnequal, sampsUnequal(30, 60, 'EqualSDs'))
.)
Now the sampling and testing passed to the sim
function.
= sim(
dfPvalsWelch fun_obs = list(sampsUnequal, SDs = c('EqualSDs', 'UnequalSDs')),
n_obs = c(27, 54, 81),
fun_test = wTestForUnequal
)
(The descriptives can be quickly checked via
neatStats::peek_neat(dfPvalsWelch, c("H0sd1", "H0sd2", "H1sd1", "H1sd2"), group_by = c('SDs'))
;
neatStats::peek_neat(dfPvalsWelch, c("H0smd", "H1smd"), group_by = c('SDs'))
;
neatStats::peek_neat(dfPvalsWelch, c("H0mDiff", "H1mDiff"), group_by = c('SDs'))
;
all correspond to the intended factors.)
Finally, the pow
function.
pow(dfPvalsWelch, alpha_locals = NA)
#> # POSSA pow() results #
#> GROUP: pow_EqualSDs
#> N(average-total) = 158.6 (if H0 true) or 98.8 (if H1 true)
#> (p) Type I error: .05000; Power: .89931
#> Local alphas: (1) .02296; (2) .02296; (3) .02296
#> Likelihoods of significance if H0 true: (1) .02291; (2) .01633; (3) .01076
#> Likelihoods of significance if H1 true: (1) .42313; (2) .32331; (3) .15287
#> GROUP: pow_UnequalSDs
#> N(average-total) = 158.7 (if H0 true) or 100.2 (if H1 true)
#> (p) Type I error: .05000; Power: .89276
#> Local alphas: (1) .02204; (2) .02204; (3) .02204
#> Likelihoods of significance if H0 true: (1) .02329; (2) .01502; (3) .01169
#> Likelihoods of significance if H1 true: (1) .41013; (2) .32371; (3) .15891
The results are pretty similar as with var.equal = TRUE
:
almost the same with equal variances (as expected), and just a
little different with unequal variances.
But let’s now see with unequal sample sizes as well.
= sim(
dfPvalsUnequal fun_obs = list(sampsUnequal, SDs = c('EqualSDs', 'UnequalSDs')),
n_obs = list(sample1 = (c(27, 54, 81) - 15),
sample2_h = (c(27, 54, 81) + 15)),
fun_test = wTestForUnequal
)pow(dfPvalsUnequal, alpha_locals = NA)
#> # POSSA pow() results #
#> GROUP: pow_EqualSDs
#> N(average-total) = 158.7 (if H0 true) or 109.3 (if H1 true)
#> (p) Type I error: .05000; Power: .88098
#> Local alphas: (1) .02092; (2) .02092; (3) .02092
#> Likelihoods of significance if H0 true: (1) .02162; (2) .01736; (3) .01102
#> Likelihoods of significance if H1 true: (1) .27931; (2) .41667; (3) .18500
#> GROUP: pow_UnequalSDs
#> N(average-total) = 158.7 (if H0 true) or 94.1 (if H1 true)
#> (p) Type I error: .04998; Power: .92658
#> Local alphas: (1) .02334; (2) .02334; (3) .02334
#> Likelihoods of significance if H0 true: (1) .02344; (2) .01513; (3) .01140
#> Likelihoods of significance if H1 true: (1) .46102; (2) .33549; (3) .13007
Now the results depend notably on SD equality: with equal SDs, the power is lower as compared to equal samples per group, but with unequal SDs it is larger (again in line with the expected statistical relations). Along with power, as again logical (due to more early stops), the expected average sample size changes substantially as well (in case of H0).
(See also e.g. O’Brien-Fleming bounds;
pow(dfPvalsUnequal, alpha_locals = c(0.0015, 0.0181, 0.0437), alpha_global = 0.05, adjust = FALSE)
,
etc.)
To simulate ranked data, such as the answers to a single Likert scale, one elegant way is to first simulate the normally distributed underlying latent variable, then convert that to the desired interval data. See the function below.
= function(s, m) {
createOrdinal findInterval(rnorm(s, mean = m, sd = 1.3), vec = c(-Inf, 1, 2, 3, 4, Inf))
}
The output can be checked e.g. as
neatStats::peek_neat(data.frame(x = createOrdinal(1000, 2)), 'x', f_plot = neatStats::plot_neat)
.
Then the sampling function for two independent samples can be as follows.
= function(sample_size) {
sampleRank list(
sample1 = createOrdinal(sample_size, 2),
sample2_h0 = createOrdinal(sample_size, 2),
sample2_h1 = createOrdinal(sample_size, 3)
) }
(For paired samples, one could first generate normally distributed
continuous values via faux::rnorm_multi,
then again convert them to interval with
findInterval()
.)
Then the testing function with Wilcoxon rank-sum test.
= function(sample1, sample2_h0, sample2_h1) {
testWilc c(
p_h0 = wilcox.test(sample1, sample2_h0, alternative = 'less')$p.value,
p_h1 = wilcox.test(sample1, sample2_h1, alternative = 'less')$p.value
) }
Then sim()
and pow()
as usual.
= sim(fun_obs = sampleRank,
dfPvalsRank n_obs = c(15, 30, 45),
fun_test = testWilc)
pow(dfPvalsRank)
#> # POSSA pow() results #
#> N(average-total) = 90.0 (if H0 true) or 90.0 (if H1 true)
#> (p) Type I error: .05113; Power: .96142
#> Local alphas: (1) none; (2) none; (3) .05000
#> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .05113
#> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .96142
These results for fixed design can be verified via another power
simulation package, MKpower
, as follows.
= function(n) {
f1 createOrdinal(n, 2)
}= function(n) {
f2 createOrdinal(n, 3)
}::sim.ssize.wilcox.test(
MKpower
f1,
f2,power = 0.9,
n.min = 45,
n.max = 47,
step.size = 1,
alternative = 'less'
)
#> Wilcoxon rank sum test
#> n = 45
#> emp.power = 0.9629
But of course only POSSA
allows sequential design;
pow(dfPvalsRank, alpha_locals = NA)
, etc.
The ANOVA power calculation here follows an example of the Superpower R package (Example #1 here), for comparability.
Writing the code for ANOVA in base R is a fuss due to the weird input (and output) mode of the dedicated function. There are various extension packages available to make this easier via wrapper functions, but for simulations it is better to use the base function to make the process as fast as possible. Even so, packages like faux make simulated factorial designs less burdensome – although in the present case a special complication is that, due to the subsampling process for sequential designs, each variable has to be passed from the sampling function to the testing function separately.
In any case, here is how a 2 by 2 mixed ANOVA can be implemented.
From this one example, it should be clear how to do any other between or
within factor combination. The details of the example scenario can be
read at Superpower’s example #1 description –
here, just the following minimal necessary information is highlighted:
There are two factors for a continuous rating variable:
Voice (robot
vs. human) and Emotion
(sad vs. cheerful).
Voice is a between-subject factor.
Emotion is a within-subject factor with a correlation
of .8
between sad and
cheerful ratings. (The groups for
robot and human are equal.) The null
hypothesis (H0) is that all variable means are equal (a rating mean of
1.03
), while the alternative hypothesis (H1) is that the
variable means have the specific differences as included in the code
below (for example, a rating mean of 1.03
for
“human cheerful voice”, but a rating mean of
1.41
for “human sad voice”).
Hence, for H0 as well as H1, two pairs of correlated (“rating”)
variables should be generated, one for robot, and one
for human. Within each, one will be
sad, the other cheerful (with a
correlation of .8
). Here is how this looks looks. (Note the
grouping notation: grp_1
for human and
grp_2
for robot.)
= function(samp_size) {
sampleAOV = faux::rnorm_multi(
dat_h0_human n = samp_size,
vars = 2,
mu = 1,
sd = 1.03,
r = 0.8
)= faux::rnorm_multi(
dat_h0_robot n = samp_size,
vars = 2,
mu = 1,
sd = 1.03,
r = 0.8
)= faux::rnorm_multi(
dat_h1_human n = samp_size,
vars = 2,
mu = c(1.03, 1.41),
sd = 1.03,
r = 0.8
)= faux::rnorm_multi(
dat_h1_robot n = samp_size,
vars = 2,
mu = c(0.98, 1.01),
sd = 1.03,
r = 0.8
)list(
# human_cheerful
grp_1_human_cheerful_h0 = dat_h0_human$X1,
grp_1_human_cheerful_h1 = dat_h1_human$X1,
# human_sad
grp_1_human_sad_h0 = dat_h0_human$X2,
grp_1_human_sad_h1 = dat_h1_human$X2,
# robot_cheerful
grp_2_robot_cheerful_h0 = dat_h0_robot$X1,
grp_2_robot_cheerful_h1 = dat_h1_robot$X1,
# robot_sad
grp_2_robot_sad_h0 = dat_h0_robot$X2,
grp_2_robot_sad_h1 = dat_h1_robot$X2
) }
Now the testing function will require a data frame to be constructed
from the passed variables, to be appropriate for the aov()
function. (Again, some wrapper function such as ez::ezANOVA
could be used for readability, but this would make the simulation
somewhat slower. However, if speed is of not much importance, those
wrapper functions work just as well.)
The way of extracting p values from the output (e.g.,
aov_sum[[1]][[1]][['Pr(>F)']][1]
, where
aov_sum = summary(aov(...))
) also look quite extraordinary,
but this is just how aov()
stores
this information. You can also use the POSSA::get_p()
function to get more readable code (e.g.,
aov_ps = get_p(aov(...))
, then
aov_ps$``Error:id``$voice
); though over many iterations
this too adds a little bit of extra processing time. In either case,
make sure you are extracting the correct p values (e.g., compare the
values with the printed results of summary(aov(...))
).
Apart from the ANOVA, the function below also includes some specific pair-wise comparisons, namely, comparing the robot and human ratings separately in the cases of sad and happy voice.
= function(grp_1_human_cheerful_h0,
testAOV
grp_1_human_cheerful_h1,
grp_1_human_sad_h0,
grp_1_human_sad_h1,
grp_2_robot_cheerful_h0,
grp_2_robot_cheerful_h1,
grp_2_robot_sad_h0,
grp_2_robot_sad_h1) {= length(grp_1_human_cheerful_h0)
len_grp1 = length(grp_2_robot_cheerful_h0)
len_grp2 = data.frame(
raw_data obs = c(
grp_1_human_cheerful_h0,
grp_1_human_sad_h0,
grp_2_robot_cheerful_h0,
grp_2_robot_sad_h0
),id = c(paste0('g1_', c(
1:len_grp1, 1:len_grp1
)),paste0('g2_', c(
1:len_grp2, 1:len_grp2
))),voice = c(rep('human', len_grp1 * 2), rep('robot', len_grp2 * 2)),
emotion = c(
rep('cheerful', len_grp1),
rep('sad', len_grp1),
rep('cheerful', len_grp2),
rep('sad', len_grp2)
)
)= summary(aov(obs ~ voice * emotion + Error(id / emotion), data =
aov_h0
raw_data))$obs = c(
raw_data
grp_1_human_cheerful_h1,
grp_1_human_sad_h1,
grp_2_robot_cheerful_h1,
grp_2_robot_sad_h1
)= summary(aov(obs ~ voice * emotion + Error(id / emotion), data =
aov_h1
raw_data))return(
c(
p_voice_h0 = aov_h0[[1]][[1]][['Pr(>F)']][1],
p_voice_h1 = aov_h1[[1]][[1]][['Pr(>F)']][1],
p_emo_h0 = aov_h0[[2]][[1]][['Pr(>F)']][1],
p_emo_h1 = aov_h1[[2]][[1]][['Pr(>F)']][1],
p_interact_h0 = aov_h0[[2]][[1]][['Pr(>F)']][2],
p_interact_h1 = aov_h1[[2]][[1]][['Pr(>F)']][2],
p_sad_rob_vs_hum_h0 = t.test(grp_1_human_sad_h0, grp_2_robot_sad_h0, var.equal = TRUE)$p.value,
p_sad_rob_vs_hum_h1 = t.test(grp_1_human_sad_h1, grp_2_robot_sad_h1, var.equal = TRUE)$p.value,
p_cheer_rob_vs_hum_h0 = t.test(
grp_1_human_cheerful_h0,
grp_2_robot_cheerful_h0,var.equal = TRUE
$p.value,
)p_cheer_rob_vs_hum_h1 = t.test(
grp_1_human_cheerful_h1,
grp_2_robot_cheerful_h1,var.equal = TRUE
$p.value
)
)
) }
It’s really good to verify this lengthy function pair via
do.call(testAOV, sampleAOV(100))
.
The sim()
and pow()
function are then
straightforward. (However, the simulation can take a while. For quick
testing, lower the number of iterations; e.g. n_iter = 1000
– same as Superpower::ANOVA_power
’s default.)
= sim(fun_obs = sampleAOV,
dfPvalsAOV n_obs = 40,
fun_test = testAOV)
#> Note: Observation numbers groupped as "grp_1" for grp_1_human_cheerful_h0, grp_1_human_cheerful_h1, grp_1_human_sad_h0, grp_1_human_sad_h1.
#> Note: Observation numbers groupped as "grp_2" for grp_2_robot_cheerful_h0, grp_2_robot_cheerful_h1, grp_2_robot_sad_h0, grp_2_robot_sad_h1.
pow(dfPvalsAOV)
#> # POSSA pow() results #
#> N(average-total) = 80.0 (if H0 true) or 80.0 (if H1 true)
#> (p_voice) Type I error: .05096; Power: .17422
#> Local alphas: (1) .05000
#> (p_emo) Type I error: .04933; Power: .79371
#> Local alphas: (1) .05000
#> (p_interact) Type I error: .05020; Power: .65884
#> Local alphas: (1) .05000
#> (p_sad_rob_vs_hum) Type I error: .05047; Power: .40084
#> Local alphas: (1) .05000
#> (p_cheer_rob_vs_hum) Type I error: .05024; Power: .05398
#> Local alphas: (1) .05000
#> Global ("combined significance") type I error: .16453 (included: p_voice, p_emo, p_interact, p_sad_rob_vs_hum, p_cheer_rob_vs_hum; power for reaching the "combined significance": .94742)
Same in Superpower
below. (Caution: this with 45000
iterations really takes very very long.)
<- Superpower::ANOVA_design(
designResult design = "2b*2w",
n = 40,
mu = c(1.03, 1.41, 0.98, 1.01),
sd = 1.03,
r = 0.8,
labelnames = c("voice", "human", "robot", "emotion", "cheerful", "sad"),
plot = TRUE
)
<- Superpower::ANOVA_power(
superPower
designResult,alpha = 0.05,
nsims = 45000,
seed = 1234,
emm = FALSE
)
#> Power and Effect sizes for ANOVA tests
#> power effect_size
#> anova_voice 17.74 0.02567
#> anova_emotion 79.58 0.10074
#> anova_voice:emotion 65.85 0.07811
#> Power and Effect sizes for pairwise comparisons (t-tests)
#> power effect_size
#> p_voice_human_emotion_cheerful_voice_robot_emotion_cheerful 5.507 -0.05149
#> p_voice_human_emotion_sad_voice_robot_emotion_sad 40.667 -0.39404
Importantly, of course, with POSSA
there is the option
for sequential testing. Here, it is particularly important to pay
attention which specific one(s) of the included tests (p values) should
have stopping alphas (or futility bounds).
For instance, as it often happens, let’s say that the interaction is the only really important test in the entire ANOVA analysis. In that case, that could be used alone with stopping alphas and futility bounds. The rest may be included with non-stopping alphas to get the related power and T1ER. This can be done e.g. as below.
= sim(fun_obs = sampleAOV,
dfPvalsAovSeq n_obs = c(15, 30, 45),
fun_test = testAOV)
#> Note: Observation numbers groupped as "grp_1" for grp_1_human_cheerful_h0, grp_1_human_cheerful_h1, grp_1_human_sad_h0, grp_1_human_sad_h1.
#> Note: Observation numbers groupped as "grp_2" for grp_2_robot_cheerful_h0, grp_2_robot_cheerful_h1, grp_2_robot_sad_h0, grp_2_robot_sad_h1.
pow(
dfPvalsAovSeq,alpha_locals = list(p_interact = c(0.002, 0.018, 0.044)),
fut_locals = list(p_interact = c(0.6, 0.4)),
alpha_loc_nonstop = list(
p_voice = 0.5,
p_emo = 0.5,
p_sad_rob_vs_hum = 0.5,
p_cheer_rob_vs_hum = 0.5
)
)
#> # POSSA pow() results ## POSSA pow() results #
#> N(average-total) = 56.4 (if H0 true) or 66.5 (if H1 true)
#> (p_interact) Type I error: .05000; Power: .65662
#> Local alphas: (1) .00226; (2) .02030; (3) .04963
#> Likelihoods of significance if H0 true: (1) .00202; (2) .01920; (3) .02878
#> Likelihoods of significance if H1 true: (1) .04331; (2) .34111; (3) .27220
#> Futility bounds: (1) .60000; (2) .40000
#> Likelihoods of exceeding futility alpha if H0 true: (1) .07787; (2) .23536
#> Likelihoods of exceeding futility alpha if H1 true: (1) .15102; (2) .05380
#> (non-stopper: p_voice) Type I error: .50089; Power: .65498
#> Local alphas (secondary): (1) .50000; (2) .50000; (3) .50000
#> Likelihoods of significance if H0 true: (1) .19976; (2) .15896; (3) .14218
#> Likelihoods of significance if H1 true: (1) .11129; (2) .25760; (3) .28609
#> Futility bounds: none
#> (non-stopper: p_emo) Type I error: .49547; Power: .95351
#> Local alphas (secondary): (1) .50000; (2) .50000; (3) .50000
#> Likelihoods of significance if H0 true: (1) .19758; (2) .15911; (3) .13878
#> Likelihoods of significance if H1 true: (1) .16751; (2) .37978; (3) .40622
#> Futility bounds: none
#> (non-stopper: p_sad_rob_vs_hum) Type I error: .49291; Power: .81924
#> Local alphas (secondary): (1) .50000; (2) .50000; (3) .50000
#> Likelihoods of significance if H0 true: (1) .19140; (2) .15609; (3) .14542
#> Likelihoods of significance if H1 true: (1) .12289; (2) .33691; (3) .35944
#> Futility bounds: none
#> (non-stopper: p_cheer_rob_vs_hum) Type I error: .49513; Power: .51316
#> Local alphas (secondary): (1) .50000; (2) .50000; (3) .50000
#> Likelihoods of significance if H0 true: (1) .19322; (2) .15704; (3) .14487
#> Likelihoods of significance if H1 true: (1) .10598; (2) .19667; (3) .21051
#> Futility bounds: none
Again, any different sample sizes could also be specified
for each of the two groups via n_obs
(e.g. n_obs = list(grp_1 = c(10, 15, 30), grp_2 = c(20, 45, 60))
).
This example is from a recent planning of a real experiment, where
the diagnostic efficiencies were compared between a polygraph (autonomic
response, AR)-based and a response time (RT)-based versions of a lie
detection method. What’s interesting here is that, while the RT
predictor variable is expected to be normally distributed in case of
both liars and truthtellers, the AR predictor variable (which is based
on a certain integrated p value) is expected to follow a
beta-distribution in case of liars and uniform distribution in case of
truthtellers. A typical way to measure diagnostic efficiency (here:
distinguishing liars from truthtellers) is area under the receiver
operating characteristic curve (AUC), which in this case can be compared
with a nonparametric DeLong’s test. However, there is no software to
calculate power for DeLong’s test for independent samples, let alone for
sequential analysis. Below is an example of how this could be done with
POSSA
.
Here, the H0 is that the two AUCs are the same, and the H1 is that the RT version yields a larger AUC. (The expectations for the AUC sizes and the distributions are based on previous research. However, since the AUC size is not straightforward to calculate from the given distributions, in order to have equal AUCs for each condition for H0 simulation, the specific distributions below were arrived at by trial-and-error adjustments of the key characteristics of the distributions.)
= function(sampSize) {
sampleAUC list(
# AR
ARtruth = -runif(n = sampSize, 0, 1),
ARliar = sort(-rbeta(n = sampSize, shape1 = 0.2, 1)),
# RT
RTtruth = rnorm(sampSize, mean = 0, sd = 30),
RTliar_h0 = rnorm(sampSize, mean = 41, sd = 30),
RTliar_h1 = rnorm(sampSize, mean = 60, sd = 30)
) }
Below the testing function. To denote predictor conditions,
1
represents lies (“positive” cases), and 0
represents truth (“negative” cases, no lies).
= function(ARtruth,
testAUC
ARliar,
RTtruth,
RTliar_h0,
RTliar_h1) {= data.frame(
predictors guilt = c(rep(0, length(ARtruth)), rep(1, length(ARliar))),
pred_ans = c(ARtruth, ARliar),
pred_RTh0 = c(RTtruth, RTliar_h0),
pred_RTh1 = c(RTtruth, RTliar_h1)
)
= pROC::roc(
AucAR response = predictors$guilt,
predictor = predictors$pred_ans,
levels = c(0, 1),
direction = "<" # second expected larger
)= pROC::roc(
AucRtH0 response = predictors$guilt,
predictor = predictors$pred_RTh0,
levels = c(0, 1),
direction = "<"
)= pROC::roc(
AucRtH1 response = predictors$guilt,
predictor = predictors$pred_RTh1,
levels = c(0, 1),
direction = "<"
)
return(
c(
p_h0 =
::roc.test(
pROC
AucAR,
AucRtH0,pair = TRUE,
alternative = "less"
$p.value,
)p_h1 =
::roc.test(
pROC
AucAR,
AucRtH1,pair = TRUE,
alternative = "less"
$p.value,
)AucAR =
as.numeric(pROC::auc(AucAR)),
AucRtH0 =
as.numeric(pROC::auc(AucRtH0)),
AucRtH1 =
as.numeric(pROC::auc(AucRtH1))
)
) }
Quick check via do.call(testAUC, sampleAUC(100000))
, in
particular to see (with this very large sample, which thereby gives
fairly accurate estimates) that the AUCs are very close in case of
H0.
Then just the usual sim()
and pow()
.
= sim(fun_obs = sampleAUC,
dfPvalsAUC n_obs = c(30, 60, 90),
fun_test = testAUC)
print(dfPvalsAUC)
#> POSSA sim() results (p values)
#> Sample:
#> .iter .look .n_total ARtruth ARliar RTtruth RTliar_h p_h0 p_h1 AucAR AucRtH0 AucRtH1
#> 1: 1 1 120 30 30 30 30 0.7724812 1.820265e-01 0.9100000 0.8666667 0.9544444
#> 2: 1 2 240 60 60 60 60 0.3407161 3.844820e-03 0.8291667 0.8527778 0.9513889
#> 3: 1 3 360 90 90 90 90 0.1707108 1.842903e-05 0.8129630 0.8544444 0.9574074
#> Descriptives:
#> AucAR:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.5878 0.8083 0.8353 0.8334 0.8606 0.9844
#> AucRtH0:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.5533 0.8088 0.8353 0.8332 0.8600 0.9844
#> AucRtH1:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.7044 0.9058 0.9233 0.9214 0.9389 1.0000
pow(dfPvalsAUC)
#> # POSSA pow() results #
#> N(average-total) = 360.0 (if H0 true) or 360.0 (if H1 true)
#> (p) Type I error: .04689; Power: .79536
#> Local alphas: (1) none; (2) none; (3) .05000
#> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .04689
#> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .79536
pow(
dfPvalsAUC,alpha_locals = c(0.002, 0.018, 0.044),
fut_locals = c(0.8, 0.4)
)
#> # POSSA pow() results #
#> N(average-total) = 259.3 (if H0 true) or 292.2 (if H1 true)
#> (p) Type I error: .05000; Power: .78798
#> Local alphas: (1) .00211; (2) .01900; (3) .04645
#> Likelihoods of significance if H0 true: (1) .00187; (2) .01684; (3) .03129
#> Likelihoods of significance if H1 true: (1) .04847; (2) .41316; (3) .32636
#> Futility bounds: (1) .80000; (2) .40000
#> Likelihoods of exceeding futility alpha if H0 true: (1) .01180; (2) .17262
#> Likelihoods of exceeding futility alpha if H1 true: (1) .01216; (2) .03084
Instead of the huge sample size of 90*4 = 360
with fixed
design for any decent SESOI (here: an AUC improvement of about 9%), with
sequential design the average required sample is reduced to 259.3 for H0
and 292.2 for H1, with only a practically negligible .007
reduction in power (.787
instead of .795
).
That’s all for now.