The two examples below implement Bayesian linear and logistic regression models using the horseshoe prior over parameters to encourage a sparse model. Models are fitted using the hsstan R package, which performs full Bayesian inference through a Stan implementation. In Bayesian inference model meta-parameters such as the amount of shrinkage are also given prior distributions and are thus directly learned from the data through sampling. This bypasses the need to cross-validate results over a grid of values for the meta-parameters, as would be required to find the optimal lambda in a lasso or elastic net model.

However, Bayesian inference is computationally intensive. In high-dimensional settings, with e.g. more than 10,000 biomarkers, pre-filtering of inputs based on univariate measures of association with the outcome may be beneficial. If pre-filtering of inputs is used then a cross-validation procedure is needed to ensure that the data points used for pre-filtering and model fitting differ from the data points used to quantify model performance. The outercv() function is used to perform univariate pre-filtering and cross-validate model performance in this setting.

Parallelisation

CAUTION should be used when setting the number of cores available for parallelisation. The default setting in hsstan is to use 4 cores to parallelise Markov chains in the Bayesian inference procedure. This can be switched off either by setting options(mc.cores = 1).

Argument cv.cores in outercv() controls parallelisation over the outer CV folds. On unix/mac setting cv.cores to >1 will induce nested parallelisation which will generate an error, unless parallelisation of the chains is disabled by setting options(mc.cores = 1).

Nested parallelisation is feasible if cv.cores is >1 and multicore_fork = FALSE is set as this uses cluster based parallelisation instead of mclapply. Beware that large numbers of processes may be spawned: the code will use the product of cv.cores and mc.cores. If we are performing 10-fold cross-validation with 4 chains and set cv.cores = 10 then 40 processes will be invoked simultaneously.

Linear regression with a Bayesian shrinkage model (continuous outcomes)

We use cross-validation and apply univariate filtering of predictors and model fitting in one part of the data (training fold), followed by evaluation of model performance on the left-out data (testing fold), repeated in each fold.

Only one cross-validation split is needed (function outercv()) as the Bayesian model does not require cross-validation for meta-parameters.

Note, in the examples below length of sampling and number of chains is curtailed for speed. We recommend 4 chains, warmup 1000 and iter 2000 in practice.

library(hsstan)

# number of cores for parallelising hsstan sampling
# store original options in oldopt
# at the end reset options to old configuration
oldopt <- options(mc.cores = 2)

# load iris dataset and simulate a continuous outcome
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
dt$outcome.cont <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2)

# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
res.cv.hsstan <- outercv(y = dt$outcome.cont, x = dt[, c(uvars, pvars)],
                         model = "model.hsstan",
                         filterFUN = lm_filter,
                         filter_options = list(force_vars = uvars,
                                               nfilter = 2,
                                               p_cutoff = NULL,
                                               rsq_cutoff = 0.9),
                         chains = 2,  # chains parallelised via options(mc.cores)
                         n_outer_folds = 3, cv.cores = 1,  # CV folds not parallelised
                         unpenalized = uvars, warmup = 100, iter = 200)

We can then view prediction performance based on the testing folds and examine the Bayesian model using the hsstan package.

summary(res.cv.hsstan)
#> Single cross-validation to measure performance
#> Outer loop:  3-fold CV
#> No inner loop
#> 150 observations, 4 predictors
#> 
#> Model:  model.hsstan 
#> Filter:  lm_filter 
#>         n.filter
#> Fold 1         3
#> Fold 2         3
#> Fold 3         3
#> 
#> Final fit:             mean   sd  2.5% 97.5% n_eff Rhat
#> (Intercept) -3.17 0.14 -3.40 -2.89   221 0.99
#> marker1      0.37 0.32 -0.36  0.90   209 1.00
#> marker2      2.07 0.22  1.70  2.47   196 1.00
#> marker3      0.11 0.31 -0.43  0.92   104 1.00
#> 
#> Result:
#>     RMSE   Rsquared        MAE   
#>   2.1226     0.4772     1.6971

sampler.stats(res.cv.hsstan$final_fit)
#>         accept.stat stepsize divergences treedepth gradients warmup sample
#> chain:1      0.9843   0.0241           0         8     14844   0.25   0.12
#> chain:2      0.9722   0.0316           0         8     11356   0.12   0.09
#> all          0.9783   0.0279           0         8     26200   0.37   0.21
print(projsel(res.cv.hsstan$final_fit), digits = 4)  # adding marker2
#>                                                      Model        KL        ELPD
#>                                             Intercept only   0.32508  -374.99055
#>                                           Initial submodel   0.32031  -374.72416
#>                                                    marker2   0.00151  -325.82289
#>                                                    marker3   0.00000  -325.88824
#>                var       kl rel.kl.null rel.kl   elpd delta.elpd
#> 1   Intercept only 0.325077     0.00000     NA -375.0  -49.10230
#> 2 Initial submodel 0.320306     0.01468 0.0000 -374.7  -48.83591
#> 3          marker2 0.001506     0.99537 0.9953 -325.8    0.06535
#> 4          marker3 0.000000     1.00000 1.0000 -325.9    0.00000

Here adding marker2 improves the model fit: substantial decrease of KL-divergence from the full model to the submodel. Adding marker3 does not improve the model fit: no decrease of KL-divergence from the full model to the submodel.

Logistic regression with a Bayesian shrinkage model (binary outcomes)

We use cross-validation and apply univariate filtering of predictors and model fitting in one part of the data (training fold), followed by evaluation of model performance on the left-out data (testing fold), repeated in each fold.

Only one cross-validation split is needed (function outercv) as the Bayesian model does not require cross-validation for meta-parameters.

# sigmoid function
sigmoid <- function(x) {1 / (1 + exp(-x))}

# load iris dataset and create a binary outcome
set.seed(267)
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
rownames(dt) <- paste0("sample", c(1:nrow(dt)))
dt$outcome.bin <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt))
dt$outcome.bin <- factor(dt$outcome.bin)

# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
res.cv.hsstan <- outercv(y = dt$outcome.bin,
                         x = as.matrix(dt[, c(uvars, pvars)]),
                         model = "model.hsstan",
                         filterFUN = ttest_filter,
                         filter_options = list(force_vars = uvars,
                                               nfilter = 2,
                                               p_cutoff = NULL,
                                               rsq_cutoff = 0.9),
                         chains = 2,  # parallelise over chains
                         n_outer_folds = 3, cv.cores = 1,
                         unpenalized = uvars, warmup = 100, iter = 200)

We view prediction performance based on testing folds and examine the model.

summary(res.cv.hsstan)
#> Single cross-validation to measure performance
#> Outer loop:  3-fold CV
#> No inner loop
#> 150 observations, 4 predictors
#> FALSE  TRUE 
#>    78    72 
#> 
#> Model:  model.hsstan 
#> Filter:  ttest_filter 
#>         n.filter
#> Fold 1         3
#> Fold 2         3
#> Fold 3         3
#> 
#> Final fit:             mean   sd  2.5% 97.5% n_eff Rhat
#> (Intercept) -0.12 0.23 -0.56  0.27   200    1
#> marker1      0.50 0.30 -0.10  1.16   208    1
#> marker2      1.91 0.36  1.26  2.66   263    1
#> marker3      0.01 0.24 -0.55  0.69   171    1
#> 
#> Result:
#>          Reference
#> Predicted FALSE TRUE
#>     FALSE    56   23
#>     TRUE     22   49
#> 
#>               AUC            Accuracy   Balanced accuracy   
#>            0.8284              0.7000              0.6993

# examine the Bayesian model
print(projsel(res.cv.hsstan$final_fit), digits = 4)  # adding marker2
#>                                                      Model        KL        ELPD
#>                                             Intercept only   0.20643  -104.26957
#>                                           Initial submodel   0.20118  -104.17411
#>                                                    marker2   0.00060  -73.84232
#>                                                    marker3   0.00000  -73.93210
#>                var        kl rel.kl.null rel.kl    elpd delta.elpd
#> 1   Intercept only 2.064e-01     0.00000     NA -104.27  -30.33748
#> 2 Initial submodel 2.012e-01     0.02543  0.000 -104.17  -30.24201
#> 3          marker2 5.964e-04     0.99711  0.997  -73.84    0.08977
#> 4          marker3 9.133e-18     1.00000  1.000  -73.93    0.00000
options(oldopt)  # reset options

Here adding marker2 improves the model fit: substantial decrease of KL-divergence from the full model to the submodel. Adding marker3 does not improve the model fit: no decrease of KL-divergence from the full model to the submodel.

Note

At time of writing, there appears to be a bug in rstan (used by hsstan) leading to it ignoring the pass-thru argument cores and instead spawning multiple processes as specified by the chain argument. This behaviour can be limited by setting options(mc.cores = ..).

Troubleshooting

A key problem with parallelisation in R is that errors, warnings and user input have to be suppressed during multicore processing. If a nestedcv call is not working, we recommend that you try it with cv.cores = 1 first to check it starts up without error messages.

Citation

If you use this package, please cite as:

Lewis MJ, Spiliopoulou A, Goldmann K, Pitzalis C, McKeigue P, Barnes MR (2023). nestedcv: an R package for fast implementation of nested cross-validation with embedded feature selection designed for transcriptomics and high dimensional data. Bioinformatics Advances. https://doi.org/10.1093/bioadv/vbad048

References

Carpenter, B., et al. Stan: A Probabilistic Programming Language. Journal of Statistical Software 2017;76(1):1-32.

Piironen, J. and Vehtari, A. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics 2017;11(2):5018-5051, 5034.