---
title: "Validating predictInterval() against brms"
author: "Jared Knowles"
date: "Written 2026-05-29 (last updated 2026-05-29)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Validating predictInterval() against brms}
  %\VignetteEncoding{UTF-8}
---

<style type="text/css">pre, pre code { white-space: pre-wrap !important; word-break: break-word; overflow-wrap: anywhere; }</style>



## Why compare to brms?

`merTools::predictInterval()` was written to put prediction intervals on
mixed-effect models that are too large or too slow to bootstrap. It does this by
simulating from the estimated distribution of the fixed and random effects --
fast, but an approximation. Since then, [brms](https://paulbuerkner.com/brms/)
has become a standard for *fully* Bayesian multilevel modeling, and its
posterior predictive distribution is about as close to a gold standard as we
have. This vignette asks a simple question: **how close are
`predictInterval()`'s intervals to the ones brms produces?**

We look at three things: a Gaussian random-slopes model, what happens for an
entirely unobserved group, and a binomial GLMM. This vignette is *precompiled* --
reproducing it requires the `brms` package and a working Stan toolchain.

## A random-slopes model

We use the bundled `hsb` data (7,185 students in 160 schools) and model math
achievement from student SES, school-mean SES, and a school-varying SES slope.
We hold out 20% of students for testing, plus six whole schools to stand in for
"new groups" later.


``` r
data(hsb)
hsb$schid <- factor(hsb$schid)
new_schools <- sample(levels(hsb$schid), 6)
hsb$.set <- "train"
hsb$.set[hsb$schid %in% new_schools] <- "test_new"
seen <- which(hsb$.set == "train")
hsb$.set[sample(seen, round(0.2 * length(seen)))] <- "test_seen"
train     <- droplevels(hsb[hsb$.set == "train", ])
test_seen <- hsb[hsb$.set == "test_seen" & hsb$schid %in% levels(train$schid), ]
test_new  <- hsb[hsb$.set == "test_new", ]

f1 <- mathach ~ ses + meanses + (ses | schid)
m_lme <- lmer(f1, data = train)
brms_fit_seconds <- system.time(
  m_brm <- fit_brm(f1, train, gaussian(), "brms_hsb_meanses"))[["elapsed"]]
```

We draw a predictive distribution for each held-out student from each method --
`predictInterval(returnSims = TRUE)` and `brms::posterior_predict()` -- and
compute everything (point estimates, intervals, coverage) from those draws.


``` r
mt_seen <- mt_draws(m_lme, test_seen)
pp_seen <- posterior_predict(m_brm, newdata = test_seen, allow_new_levels = TRUE)
```

### Point estimates are essentially identical

Both methods condition on the same estimated fixed effects and school BLUPs, so
the conditional-mean predictions should agree almost exactly -- and they do.


``` r
lme_mean <- predict(m_lme, newdata = test_seen)
brm_mean <- colMeans(posterior_epred(m_brm, newdata = test_seen, allow_new_levels = TRUE))
c(correlation = cor(lme_mean, brm_mean),
  mean_abs_diff = mean(abs(lme_mean - brm_mean)),
  response_sd = sd(test_seen$mathach))
```

```
#>   correlation mean_abs_diff   response_sd 
#>    0.99984045    0.03670134    6.71745425
```

``` r
ggplot(data.frame(merTools = lme_mean, brms = brm_mean), aes(merTools, brms)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.3, size = 0.8, color = "#1B9E77") + coord_equal() +
  labs(x = "lme4 predict()", y = "brms posterior_epred()") +
  theme_minimal(base_size = 12)
```

<div class="figure" style="text-align: center">
<img src="brms-point-1.png" alt=""  />
</div>

### The prediction intervals agree


``` r
b_mt <- qband(mt_seen, 0.90); b_brm <- qband(pp_seen, 0.90)
c(width_merTools = median(b_mt$upr - b_mt$lwr),
  width_brms = median(b_brm$upr - b_brm$lwr),
  cor_lower = cor(b_mt$lwr, b_brm$lwr),
  cor_upper = cor(b_mt$upr, b_brm$upr))
```

```
#> width_merTools     width_brms      cor_lower      cor_upper 
#>     20.1789145     20.2344075      0.9901934      0.9899334
```

``` r
endpts <- rbind(data.frame(bound = "lower", merTools = b_mt$lwr, brms = b_brm$lwr),
                data.frame(bound = "upper", merTools = b_mt$upr, brms = b_brm$upr))
ggplot(endpts, aes(merTools, brms, color = bound)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.3, size = 0.8) + coord_equal() +
  labs(title = "90% prediction-interval endpoints", x = "merTools", y = "brms") +
  theme_minimal(base_size = 12)
```

<div class="figure" style="text-align: center">
<img src="brms-intervals-1.png" alt=""  />
</div>

### And they are equally well calibrated

The real test is out-of-sample: do nominal intervals cover held-out scores at
the nominal rate? Both methods track the diagonal, and each other.


``` r
data.frame(nominal = NOMINAL,
           merTools = coverage(mt_seen, test_seen$mathach, NOMINAL),
           brms     = coverage(pp_seen, test_seen$mathach, NOMINAL))
```

```
#>   nominal  merTools      brms
#> 1    0.50 0.4569776 0.4598698
#> 2    0.80 0.7895879 0.7932032
#> 3    0.90 0.9168474 0.9168474
#> 4    0.95 0.9660159 0.9703543
```

### At a fraction of the cost


``` r
t_lme <- system.time(lmer(f1, data = train))[["elapsed"]]
t_mt  <- system.time(mt_draws(m_lme, test_seen))[["elapsed"]]
t_pp  <- system.time(posterior_predict(m_brm, newdata = test_seen,
                                       allow_new_levels = TRUE))[["elapsed"]]
data.frame(  # brms_fit_seconds was captured when the model was fit, above
  step = c("lme4 fit", "predictInterval", "brms fit (compile+sample)", "posterior_predict"),
  seconds = round(c(t_lme, t_mt, brms_fit_seconds, t_pp), 2))
```

```
#>                        step seconds
#> 1                  lme4 fit    0.05
#> 2           predictInterval    0.53
#> 3 brms fit (compile+sample)  398.97
#> 4         posterior_predict    1.52
```

For this model, the entire lme4 + `predictInterval()` path runs in about a
second, while a single brms fit takes several minutes -- a couple of orders of
magnitude difference, before the (much larger) models `predictInterval()` was
really built for.

## What about an entirely new group?

For a school that was never seen during fitting there is a genuine modeling
choice: do you *drop* the school's effect and predict from the fixed effects
only, or *sample* a plausible effect for it from the estimated school-level
distribution? Both packages support either, and the comparison is only fair when
the choices match:

| treatment of an unseen group | `predictInterval()` | brms |
|---|---|---|
| drop the effect (population prediction) | `new.levels = "zero"` *(default)* | `re_formula = NA` |
| sample a new effect | `new.levels = "draw"` | `allow_new_levels = TRUE` *(default)* |

Matched up, the two tools agree on both the width and the coverage of the
new-school intervals:


``` r
nd <- test_new
pp_pop  <- posterior_predict(m_brm, newdata = nd, re_formula = NA)
pp_draw <- posterior_predict(m_brm, newdata = nd, allow_new_levels = TRUE)
mt_zero <- mt_draws(m_lme, nd, new.levels = "zero")
mt_draw <- mt_draws(m_lme, nd, new.levels = "draw")
w <- function(d) median(qband(d, 0.90)$upr - qband(d, 0.90)$lwr)
data.frame(
  unseen_group   = c("drop effect", "sample effect"),
  merTools_width = c(w(mt_zero), w(mt_draw)),
  brms_width     = c(w(pp_pop),  w(pp_draw)),
  merTools_cov90 = c(coverage(mt_zero, nd$mathach, 0.90), coverage(mt_draw, nd$mathach, 0.90)),
  brms_cov90     = c(coverage(pp_pop,  nd$mathach, 0.90), coverage(pp_draw, nd$mathach, 0.90)))
```

```
#>    unseen_group merTools_width brms_width merTools_cov90 brms_cov90
#> 1   drop effect       19.96801   20.02855      0.8708487  0.8745387
#> 2 sample effect       20.67910   20.76924      0.8892989  0.8856089
```

So the interval width for an unseen group is set by the *choice*, not the method.
Earlier I compared `predictInterval()`'s default against brms's default, which is
an apples-to-oranges pairing -- the apparent "gap" was the two packages making
different default assumptions, not a deficiency in either.

How much the choice matters depends on how much group variance is left after the
fixed effects. Here `meanses` already absorbs most of the between-school
variation, so dropping vs. sampling the school effect barely changes the width.
Remove `meanses` and the school SD roughly doubles, so the two options diverge:


``` r
m0 <- lmer(mathach ~ ses + (ses | schid), data = train)
data.frame(
  model = c("with meanses", "without meanses"),
  school_SD = c(attr(VarCorr(m_lme)$schid, "stddev")[["(Intercept)"]],
                attr(VarCorr(m0)$schid,    "stddev")[["(Intercept)"]]),
  width_zero = c(w(mt_draws(m_lme, nd, new.levels = "zero")),
                 w(mt_draws(m0,    nd, new.levels = "zero"))),
  width_draw = c(w(mt_draws(m_lme, nd, new.levels = "draw")),
                 w(mt_draws(m0,    nd, new.levels = "draw"))))
```

```
#>             model school_SD width_zero width_draw
#> 1    with meanses  1.598503   19.96801   20.67910
#> 2 without meanses  2.125770   19.94993   21.22576
```

The practical guidance: `predictInterval()`'s default (`new.levels = "zero"`)
answers "what is the expected outcome for a typical member of an unseen group,"
while `new.levels = "draw"` answers "what is the outcome for an unseen group
drawn from the population" -- the same quantity brms returns by default. Pick the
one that matches your question.

## A generalized linear mixed model

The story carries over to GLMMs. We fit a binomial model for whether grouse had
ticks and compare predicted probabilities -- `predictInterval(type =
"probability")` against `brms::posterior_epred()`.


``` r
data(grouseticks)
grouseticks$TICKS_BIN <- as.integer(grouseticks$TICKS >= 1)
grouseticks$cHEIGHT   <- as.numeric(scale(grouseticks$HEIGHT))
gk <- grouseticks
gk$.set <- ifelse(runif(nrow(gk)) < 0.2, "test", "train")
gk_tr <- droplevels(gk[gk$.set == "train", ])
gk_te <- gk[gk$.set == "test" & gk$BROOD %in% levels(gk_tr$BROOD) &
            gk$LOCATION %in% unique(gk_tr$LOCATION), ]

gf <- TICKS_BIN ~ YEAR + cHEIGHT + (1 | BROOD) + (1 | LOCATION)
g_lme <- glmer(gf, family = binomial, data = gk_tr,
               control = glmerControl(optimizer = "bobyqa"))
g_brm <- fit_brm(gf, gk_tr, bernoulli(), "brms_grouseticks")

mt_pr <- mt_draws(g_lme, gk_te, type = "probability", include.resid.var = FALSE)
if (max(mt_pr) > 1 || min(mt_pr) < 0) mt_pr <- plogis(mt_pr)
brm_pr <- posterior_epred(g_brm, newdata = gk_te, allow_new_levels = TRUE)
mt_p <- apply(mt_pr, 2, median); brm_p <- apply(brm_pr, 2, median)
c(correlation = cor(mt_p, brm_p), mean_abs_diff = mean(abs(mt_p - brm_p)))
```

```
#>   correlation mean_abs_diff 
#>    0.99488587    0.02704796
```

``` r
ggplot(data.frame(merTools = mt_p, brms = brm_p), aes(merTools, brms)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.4, color = "#7570B3") + coord_equal(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(title = "Predicted P(tick): merTools vs brms",
       x = "predictInterval(type = 'probability')", y = "posterior_epred()") +
  theme_minimal(base_size = 12)
```

<div class="figure" style="text-align: center">
<img src="brms-glmm-1.png" alt=""  />
</div>

## Takeaways

- For **observed groups**, `predictInterval()` reproduces brms's point estimates
  and prediction intervals almost exactly, with the same out-of-sample coverage,
  at a tiny fraction of the computational cost.
- For **new groups**, you choose how to treat the unseen effect:
  `new.levels = "zero"` (the default) drops it and predicts from the fixed
  effects, while `new.levels = "draw"` samples it from the estimated covariance
  -- matching brms's `re_formula = NA` and `allow_new_levels = TRUE`
  respectively. Either way the two packages agree; pick the convention that fits
  your question.
- The same agreement holds for **GLMMs** on the probability scale.

In short, when bootstrapping or full MCMC is impractical, `predictInterval()` is
a fast, well-calibrated stand-in for the cases it was designed to handle.
