In this vignette, we will generate and plot simulations that will be used throughout this package and in the corresponding manuscript(s) for detecting and correction for group-level/batch effects in multi-dimensional scientific data. These simulations will focus on examples with \(n=200\) samples in \(d=2\) dimensions.
We will start by generating some code which will produce exploratory plots for us:
# a function for plotting a scatter plot of the data
plot.sim <- function(Ys, Ts, Xs, title="",
xlabel="Covariate",
ylabel="Outcome (1st dimension)") {
data = data.frame(Y1=Ys[,1], Y2=Ys[,2],
Group=factor(Ts, levels=c(0, 1), ordered=TRUE),
Covariates=Xs)
data %>%
ggplot(aes(x=Covariates, y=Y1, color=Group)) +
geom_point() +
labs(title=title, x=xlabel, y=ylabel) +
scale_x_continuous(limits = c(-1, 1)) +
scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"),
name="Group/Batch") +
theme_bw()
}
# a function for plotting a scatter plot of the data, along with
# lines denoting the expected outcome (per covariate level)
plot.sim.expected <- function(Ys, Ts, Xs, Ytrue, Ttrue, Xtrue, title="",
xlabel="Covariate",
ylabel="Outcome (1st dimension)") {
data = data.frame(Y1=Ys[,1], Y2=Ys[,2],
Group=factor(Ts, levels=c(0, 1), ordered=TRUE),
Covariates=Xs)
data.true = data.frame(Y1=Ytrue[,1], Y2=Ytrue[,2],
Group=factor(Ttrue, levels=c(0, 1), ordered=TRUE),
Covariates=Xtrue)
data %>%
ggplot(aes(x=Covariates, y=Y1, color=Group)) +
geom_point(alpha=0.5) +
labs(title=title, x=xlabel, y=ylabel) +
geom_line(data=data.true, aes(group=Group), linewidth=1.2) +
scale_x_continuous(limits = c(-1, 1)) +
scale_color_manual(values=c(`0`="#bb0000", `1`="#0000bb"),
name="Group/Batch") +
theme_bw()
}
In this simulation, we will generate a sigmoidal simulation with
\(n=200\) samples, where the samples
have an equal probability of being from each group/batch. The
n
argument controls the number of samples.
The default is for the level of covariate overlap, the area under the curve upper-bounded by the minimum density across both groups/batches, is \(1\):
Notice that the outcomes for group/batch \(0\) appear to generally be greater than the outcomes for group/batch \(1\). This is confirmed by looking at the expected signal for each dimension, shown below as a line:
plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
sim$Ytrue, sim$Ttrue, sim$Xtrue,
title="Sigmoidal Simulation")
When the covariates are balanced, the relationship is rather obvious
and easy to differentiate without needing to look at the expected
outcomes at all. However, when the covariates become imbalanced, it
becomes less obvious. This is controlled by the
unbalancedness
parameter, and the covariate distributions
are the same across groups/batches only when
unbalancedness
is \(1\).
For instance, when we set it to \(3\):
sim = cb.sims.sim_sigmoid(n=n, unbalancedness = 3)
plot.sim(sim$Ys, sim$Ts, sim$Xs,
title="Sigmoidal Simulation (Imbalanced Covariates)")
In this case, using only the observed samples, it is much harder to tell that group/batch \(0\) tends to have higher outcomes than group/batch \(1\), despite the fact that the expected signals (for a given covariate level) are the same as the preceding simulation:
plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
sim$Ytrue, sim$Ttrue, sim$Xtrue,
title="Sigmoidal Simulation")
Note also that the level of covariate overlap has decreased:
If we shift unbalancedness
below \(1\), this pattern is reversed: group/batch
\(0\) will tend to be to the right of
the origin, and group/batch \(1\) to
the left of the origin:
In this simulation, we will generate a linear simulation with \(n=200\) samples, where the samples have an equal probability of being from each group/batch.
sim = cb.sims.sim_linear(n=n)
plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
sim$Ytrue, sim$Ttrue, sim$Xtrue,
title="Linear Simulation")
We can manipulate the ratio of samples in each group, via the \(\pi\) argument, where the probability a sample is from group/batch \(1\) is \(\pi\), and the probability a sample is from group/batch \(0\) is \(1 - \pi\). Notice that decreasing \(\pi\) to \(0.25\) yields more samples in batch \(0\) than batch \(1\):
sim = cb.sims.sim_linear(n=n, pi = 0.25)
plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
sim$Ytrue, sim$Ttrue, sim$Xtrue,
title="Linear Simulation (Imbalanced Groups/Batches)")
and there are relatively few numbers of blue points in group/batch \(1\).
In this simulation, we will generate an impulse simulation with \(n=200\) samples, where the samples have an equal probability of being from each group/batch.
sim = cb.sims.sim_impulse(n=n)
plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
sim$Ytrue, sim$Ttrue, sim$Xtrue,
title="Impulse Simulation")
We can use the nbreaks
argument to get finer
(nbreaks
is larger) or more coarse (nbreaks
is
smaller) granularity for the expected signal at a given covariate level.
For instance, decreasing nbreaks
to \(5\) breakpoints gives:
In this simulation, we will generate an impulse simulation with \(n=200\) samples. In the preceding examples,
notice that as the level of covariate overlap (controlled by
unbalancedness
) decreases, the covariate distributions
shift (symmetrically) to opposite sides of the origin. For this
simulation, that is not the case; for gorup \(0\), the covariate distribution is
symmetric about the origin, but for group \(1\), the covariate distribution is
asymmetric about the origin (and controlled by
unbalancedness
). Let’s see first when
unbalancedness
is \(1\),
and the covariate distributions are the same:
sim = cb.sims.sim_impulse_asycov(n=n, unbalancedness = 1)
plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
sim$Ytrue, sim$Ttrue, sim$Xtrue,
title="Impulse Simulation")
When we shift the unbalancedness
higher, notice that the
covariate distribution for group/batch \(1\) shifts to the right, but for
group/batch \(0\) remains in the same
(symmetric about the origin) place:
sim = cb.sims.sim_impulse_asycov(n=n, unbalancedness = 4)
plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
sim$Ytrue, sim$Ttrue, sim$Xtrue,
title="Impulse Simulation (Asymmetric Covariates)")
This pattern/observation is again reversed when
unbalancedness
is below 1
:
sim = cb.sims.sim_impulse_asycov(n=n, unbalancedness = 1/5)
plot.sim.expected(sim$Ys, sim$Ts, sim$Xs,
sim$Ytrue, sim$Ttrue, sim$Xtrue,
title="Impulse Simulation (Asymmetric Covariates)")
To fully appreciate the breadth/options for the simulations, be sure
to check their individual documentation pages using the
help()
menu.