MCMC through the function `xdnuts`

This function is the core of the package; it employs DHMC with varying trajectory lengths. To use it, one must:

Write the formula for a probability distribution, such as a Bayesian posterior density, which does not necessarily need to be normalized: $\pi(\theta|y)$ .
Transform it into potential energy: $U(\theta) = -\log\pi(\theta|y)$ .
Derive the formula for its gradient with respect only to the continuous components. This can be done either analytically or numerically using an R function, such as numDeriv::grad. The first approach requires more manual effort but pays off in terms of execution time.

Remark 1: If the gradient derivation is done analytically, it is always a good idea to check whether it is correct by comparing it with its numerical counterpart!

Once these preliminary steps are completed, they must be translated into R.

Remark 2: The use of other programming languages, such as C++ via the Rcpp package, does not integrate well with the C++ functions compiled into the package. Therefore, if one considers speeding up computation by writing the function for the posterior distribution in a more efficient language, this is usually not a good idea!

This requires the creation of two R objects:

A function to be passed to the nlp argument of the xdnuts function. nlp refers to negative log probability (posterior). The name of this function and its arguments are not important, but it must have three arguments, which are as follows:
1. par, a vector of length $d$ representing a value for $\theta$ . This vector must be ordered such that the first $d-k$ components correspond to the continuous ones, for which the gradient must be evaluated, and the last $k$ components refer to the discontinuous ones, for which no derivation is required.
2. args, a list containing the arguments needed for the negative log posterior computation, namely the data and hyperparameters.
3. eval_nlp, a logical value; if TRUE, the function must return the negative log posterior, hence a scalar value. If FALSE, the function must return the gradient of only the continuous components.
A list of elements to pass to the args argument of the xdnuts function, which refers to the second argument of the function mentioned above.

Remark 3: The package currently cannot handle bounded probability spaces. If there are parameters with bounded support, this must be specified within the nlp function by assigning infinite negative log posterior probability to all values outside the support, or by applying a reparameterization. The latter is recommended as a more elegant solution which allows for easy transformation back to the original parameterization by applying the inverse transformation to the MCMC samples.

After this, we can finally perform MCMC!

Let N_chains be the number of chains we wish to obtain. We must first define a starting value for each chain. This should be placed in a list of N_chains vectors, each of length $d$ . For example, we can simulate it with a function like this:

theta0 <- lapply(seq_len(N_chains),myfunction)

If we wish to perform the sampling in parallel, we can specify parallel = TRUE (always ensure that enough cores are available!). In this case, if there are any external R objects or packages related to the nlp function or the args list, it is necessary to load them on each core. This can be done by providing the names of the objects in a character vector to be passed to the loadLibraries and loadRObject arguments of the xdnuts function.

The length of each chain is specified via the N argument, while thinning is specified by thin (the total number of samples is then N times thin, but only N are returned). K, in uppercase, specifies the number of recycled samples (Nishimura and Dunson (2020), by default, this occurs only during the warm-up phase for improving the estimates of the Mass Matrix $M$ ), while k, in lowercase, defines the number of discontinuous components.

The method argument specifies the type of algorithm; there are several possible choices:

method = 'HMC': This applies the classic termination criterion with a fixed step length $L$ . The value of $L$ must then be specified as another argument of the xdnuts function, for example with L = 10. It is suggested to perform some preliminary runs starting with moderately small values, inspecting the autocorrelation plots (for example, using plots(model,type = 6)),and then increasing it until the autocorrelation appears negligible. To mitigate the drawbacks of using a fixed value for L, it is possible to sample $L$ uniformly in a neighborhood around L. This can be regulated by the L_jitter argument passed to the set_parameters() function, which defines the control argument of the xdnuts function. The interval is constructed as follows: $[\max(1,L-L_{jitter}),L + L_{jitter}]$ .
method = 'XHMC': This applies the virial exhaustion criterion (Betancourt (2016b)) for identifying sufficient trajectory exploration. A threshold $\tau$ must be specified as an argument of the xdnuts function. For example, with tau = 0.1; it is suggested to perform some preliminary runs starting with moderately large values, inspecting the autocorrelation plots (for example, using plots(model,type = 6)), and then decreasing it until the autocorrelation appears negligible.
method = 'NUTS': This applies the No U-Turn Sampler criterion (Hoffman, Gelman, et al. (2014)) for identifying a collapsing trend in the trajectory. The advantage of this approach is that it is completely automatic, requiring no further calibration, but it is not guaranteed to perform efficient exploration.

One must keep in mind that, as shown in Betancourt (2016b), each of these criteria has its flaws. The virial-based method seems the most appealing, as it allows for both manual calibration and varying trajectory lengths dictated by the local geometry of the probability density. However, due to its simplicity and often effective results, the default choice for the xdnuts function is method = 'NUTS'.

Set algorithm’s features using the `set_parameters` function

The output of this function is given to the control argument of the xdnuts function, serving the purpose of regulating more subtle features of the algorithm.

N_init1: Defines the number of iterations for the first calibration of $\epsilon$ .
N_adapt: Defines the duration of the warm-up phase. After this phase, if M_type = 'diagonal' or 'dense', the samples collected are used to estimate the Mass Matrix $\hat{M} = \hat{\Sigma}^{-1}$ , where $\Sigma$ is the posterior covariance matrix. This step is crucial for facilitating exploration of the parameter space, as modifying the covariance matrix of the momentum distribution indirectly applies a rotation to the original probability space.
N_init2: Defines the number of iterations for the second calibration of $\epsilon$ . This is a mandatory step following the estimation of the Mass Matrix. However, even if M_type = 'identity', there are advantages to performing this step, as it can correct anomalies encountered during the first calibration due to an initial chain location far from areas with higher probability mass.
burn_adapt_ratio: A numeric scalar $\in (0,1]$ indicating the ratio of warm-up samples to discard when estimating the covariance matrix of the parameters. Typically, HMC quickly reaches the region with higher probability mass (also known as the Typical Set of the probability space), but if that’s not the case, not discarding enough initial samples can lead to an overestimation of $\Sigma$ (and consequently an underestimation of $M$ ), which can cause mixing issues and inefficiencies.
keep_warm_up: If set to TRUE, stores the warm-up phase for each chain. This can help diagnose insufficient burning of the warm-up samples (e.g., via plot(model, warm_up = TRUE)).
recycle_only_init: If set to TRUE, allows recycling even during the sampling phase.
delta: A vector containing the desired Metropolis acceptance rates, including both global and potential difference-related rates. The default values are 0.8 and 0.6, respectively, but other values may be considered based on chain mixing and the presence of divergent transitions. The latter occur when the precision of the symplectic integrator in certain areas of the probability space is insufficient. This can be addressed by increasing the nominal value (for example, control = set_parameters(delta = c(0.95,0.6))) or by reparametrizing the model, as these issues typically arise when the probability space is characterized by a density with extreme curvature, which can be alleviated through such ad hoc measures. If the second element of the vector is set to NA, the step size calibration is conducted solely through the global acceptance probabilities.
eps_jitter: A numeric scalar that regulates the amount of jittering used to perturb the value of the step size for each iteration of the chain after the warm-up phase. Increasing this value may mitigate the presence of divergent transitions, as it allows smaller values of $\epsilon$ to be used.
max_treedepth: An integer that regulates the maximum depth of the binary tree used to approximate Hamilton’s equations for exploring each energy level set. If many trajectories reach this upper limit, it may indicate the presence of flat areas in the probability density. Increasing this value could make the computational burden grow exponentially.
L_jitter: An integer scalar that regulates the amount of jittering used to perturb the value of the trajectory length if specified to be constant. This occurs when the classic Hamiltonian Monte Carlo algorithm is used with the method = "HMC" option in the xdnuts function. See above for more details.
M_type: A character value specifying the type of Mass Matrix to estimate:
- "identity", no Mass Matrix estimation is done.
- "diagonal", a diagonal Mass Matrix is estimated during the warm-up phase.
- "dense", a full dense Mass Matrix is estimated during the warm-up phase.
refresh: A numeric scalar bounded in $(0,1)$ that regulates the update frequency of the displayed sampling process state. The default value is 0.1, meaning updates occur every 10% of the total samples.
l_eps_init: A numeric scalar representing the logarithm of the initial step size value. This can be used, along with N_init1 = 0 and N_init2 = 0, to prevent an automatic adaptation of $\epsilon$ when its performance is poor.
different_stepsize: When set to TRUE, this option applies a parallel adaptation scheme to ensure that the nominal and empirical refraction rates of each discontinuous component remain closely aligned. To maintain the physical interpretation of the trajectory as solutions to Hamilton’s equations, the different step sizes are derived from the redefinition of the diagonal elements of the discontinuous mass matrix. If set to FALSE, only a global $\epsilon$ is adapted, which penalizes deviations from both the global nominal value and each refraction rate.
M_cont: Allows specification of the Mass Matrix for the continuous components. This may be useful if the model adaptation is done in more sequential steps.
M_disc: The same as above, but concerning the discontinuous components.

Model diagnostic through the `print`, `summary` and `plot` functions

`print`

After adapting the model, you can print the returned XDNUTS object to the console to view its features. The resulting output is self-explanatory, so no further comments are necessary. The main purpose of this function is to provide a quick overview of each chain, summarizing important features that aid in diagnosing potential issues.

`summary`

This function focuses more on the model outcomes rather than the characteristics of the algorithm. In fact, these two aspects are not necessarily related; an algorithm may produce good inference even in the presence of internal issues. Therefore, it’s better to have different functions to describe them. The function returns a list containing several posterior statistics, such as the mean, standard deviation, quantiles specified in the q.val argument, Effective Sample Size (ESS, Geyer (2011)), Potential Scale Reduction Factor (PSRF, Gelman and Rubin (1992)), and Bayesian Fraction of Missing Information (BFMI, Betancourt (2016a)) across all chains.

ESS is a raw measure of the information regarding the posterior distribution contained in the MCMC samples. For the $i$ -th parameter, it is computed as: $\frac{N}{1+2\sum_{m = 1}^{M} \rho_i(k)}$ where $N$ is the length of the chain, $M$ is a sufficiently large integer, and $\rho_i(m)$ is the empirical auto-correlation function of lag $m$ for the $i$ -th parameter. The computation is performed using the coda::effectiveSize function.
PSRF is a statistic that compares the between-chain and within-chain variance of each parameter’s empirical distribution across chains. The idea is that if the variability between chains ( $B$ ) is close to zero then all the chains variability is explained by the within chains one ( $W$ ), then the ratio $R = \frac{B + W}{W}$ should be approximately one. Larger values provide evidence against the null hypothesis of chain convergence. The empirical estimator is obtained using the coda::gelman.diag function, which returns both the statistic and its upper confidence interval. Typically, values greater than $1.1$ are considered indicative of poor convergence, which can often be addressed through reparameterization or by increasing the length of the chains.
BFMI is a summary statistic that assesses the adequacy of the momenta distribution specification. A more detailed explanation of this topic can be found in Betancourt (2016a). The concept is similar to the PSRF statistic discussed earlier: the variance of the energy level set explored by Hamiltonian Monte Carlo can be partitioned using the law of total variance: $\mathbb{V}(E) = \mathbb{V}_{\theta}[\mathbb{E}(E|\theta)] + \mathbb{E}[\mathbb{V}(E|\theta)]$ Thus the ratio $\frac{\mathbb{E}[\mathbb{V}(E|\theta)]}{\mathbb{V}(E)}$ summarizes the portion of energy variability determined by the momenta distribution, which corresponds to the energy probability law conditional on the $\theta$ values. Low values of this ratio suggest that the variability in energy is primarily induced by the distribution of $\theta$ , indicating that the momenta distribution may be sub-optimal. The package supports only Gaussian distributions (for continuous components) and/or Laplace distributions (for discontinuous components), with the covariance matrix as the sole degree of freedom. Denoting the energy level set by $E$ , the empirical estimator of this quantity is given by: $\widehat{BFMI} = \frac{\sum_{i=2}^N(E_i - E_{i-1})^2}{\sum_{i=1}^N (E_i - \bar{E})^2}$ This statistic often yields values greater than $1$ due to estimation error, with values below $0.2$ generally considered problematic.

One common feature concerning both the algorithm and the sample inference is the number of divergent transitions encountered by the trajectories during the sampling process. For this reason, this information is reported in both the print and summary outputs. Such issues indicate both an inefficacy of the algorithm to approximate Hamilton’s equations and potential inference bias due to density underestimation in the area where the symplectic integrator diverges, leading to the bouncing back of the fictitious particle. Possible solutions include:

Reparameterizing the model.
Increasing the algorithm’s precision by setting a higher value for the first element of delta in the set_parameters function.

`plot`

This function allows you to plot various characteristics of the algorithm and samples, which can be selected using the type argument:

type = 1, marginal chains, one for each desired dimension;
type = 2, univariate and bivariate density plots;
type = 3, time series plot of the energy level sets. Good for a quick diagnostic of big models;
type = 4, stickplot of the step-length of each iteration;
type = 5, histograms of the centered marginal energy in gray and of the first differences of energy in red;
type = 6, autoregressive function plot of the parameters;
type = 7, matplot of the empirical acceptance rate and refraction rates.

The which argument accepts either a numerical vector indicating the indices of the parameters of interest or a string:

which = 'continuous' for plotting the first $d-k$ parameters.
which = 'discontinuous' for plotting the last $k$ parameters.

If type = 7, it refers to the rates index instead. When which = 'continuous', only the global acceptance rate is displayed. In contrast, when which = 'discontinuous', the refraction rates are shown.

The which_chains argument selects the chains to be used for constructing the plot, while warm_up = TRUE specifies the use of warm-up samples. This is only applicable for type = 1, 2, 6, as the other necessary warm-up quantities are not stored by the sampler. The chain colors can be customized through the colors argument.

Additional graphical customization options include:

plot.new = TRUE opens a new graphical windows;
gg = FALSE uses classic R plotting instead of plots based on the grammar of graphics paradigm, which is preferable for more interactive customization.
scale = ..., helps make classic R plots more visually appealing by rescaling the axis labels and titles.

Posterior sampling post-processing through the `xdtransform`, `xdextract` functions

It is often useful to use the Monte Carlo samples to compute various posterior quantities, which may be necessary for converting back to the original bounded parameterization, as discussed above. Other possibilities include subsampling from each chain, which can help reduce memory allocation if the samples are highly autocorrelated or to discard initial samples that have not yet reached convergence.

The package provides two functions for this purpose:

`xdtransform`

This function applies a transformation to all or some of the parameter samples while preserving the structure of an XDNUTS object. As a result, diagnostics through the summary and plot functions remain straightforward (the output of the print function does not change, as it pertains to the algorithm’s features rather than the samples themselves).

The which argument selects which parameters the transformation specified in the FUN argument should be applied to. FUN can accept multiple arguments, but the first must correspond to the parameters of interest, while the others can be specified directly in the input of the xdtransform function. If which = NULL, the function is applied to all parameters. The new parameter names can be specified through the new.names argument.

For thinning or burning the initial samples, the thin and burn arguments can be used, respectively.

`xdextract`

This function extracts the raw posterior samples into a more manageable R object, specifically a matrix or an array. While this format facilitates post-computation of the MCMC samples, it does lose the structure of the XDNUTS object.

In this case, the which argument can take either a numerical vector indicating the indices of the parameters of interest or a string:

which = 'continuous', for plotting the first $d-k$ parameters.
which = 'discontinuous', for plotting the last $k$ parameters.

Additionally, the which_chains argument allows you to select specific chains to extract. This is useful if some chains become stuck in local modes with negligible probability mass, as discarding them helps prevent bias in the Monte Carlo estimates.

If collapse = TRUE, the MCMC output is stored in a matrix rather than an array, which loses the information about parallel chain identifiers but makes it easier to manage.

As before, the thin and burn arguments can be used for thinning or burning the initial samples, respectively.

Practical examples

Example with both continuous and discontinuous components

Consider the following probabilistic model $\begin{align*} X & \sim Negbin(r,p) \notag \\ p & \sim Beta(a_0,b_0) \notag \\ r & \sim Unif(\{1,\dotsc,X\}) \notag \end{align*}$ $X$ represents the number of trials necessary to obtain $r$ successes in a series of independent and identically distributed events with probability $p$ . We can generate an arbitrary value for $X$ and set the hyperparameters $a_0$ and $b_0$ as desired:

#observed data
X <- 50

#hyperparameteers
a0 <- 10
b0 <- 10

#list of arguments
arglist <- list(data = X, hyp =  c(a0,b0))

Since $p$ is a continuous parameter and $r$ is positive discrete, the classic Hamiltonian Monte Carlo algorithms is not defined, due to lack of differentiability. To address this, we can use the method described by Nishimura, Dunson, and Lu (2020) . First, we need to specify a continuous version of $r$ , which we will denote as $\tilde{r}$ . This continuous variable $\tilde{r}$ is defined on the open interval $(1,X+1]$ . Next, consider the trivial sequence $\{a_r\}_1^{X+1}: a_r = r$ . To transfer the mass probabilities of $\mathbb{P}(r) = \frac{1}{X} \,,r = 1,\dotsc,X$ to the continuous interval spanning from $a_{r}$ to $a_{r+1}$ we divide by the length of this interval. Using this approach and the indicator function, the probability density function of $\tilde{r}$ can be given by: $\pi_{\tilde{R}}(\tilde{r}) = \displaystyle\sum_{r = 1}^X \frac{\pi_R(r)}{a_{r+1} - a_r} \mathcal{I}(a_r < \tilde{r} \leq a_{r+1})$ Since the support of this parameter is bounded, it is preferable to apply a transformation that maps $\tilde{r}$ to $\mathbb{R}$ . Any bijective function can be used for this purpose; for example, the logistic function serves this need: $\hat{r} = \log\left(\frac{\tilde{r} - 1}{ (X+1) - \tilde{r}}\right)$ The new discontinuous probability density can be derived from the cumulative distribution function. Interestingly, this approach yields the same result as applying the Jacobian determinant of the inverse transformation. $\pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\pi_R(r)}{a_{r+1} - a_r} \mathcal{I}(a_r < \tilde{r} \leq a_{r+1}) \cdot ( (X+1) - 1) \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right)$ where $\tilde{r} = 1 + \frac{(X+1) - 1}{1 + e^{-\hat{r}}}$ and since $\pi_R(r) = \frac{1}{X}$ the final result gives: $\pi_{\hat{R}}(\hat{r}) = \displaystyle\sum_{r = 1}^X \frac{\mathcal{I}(a_r < \tilde{r} \leq a_{r+1})}{a_{r+1} - a_r} \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right)$ Since both the sequence $\{a_r\}$ and the inverse logistic function are monotonic, and the original probability mass has been canceled out, the summation has no effect because there are no values of $\hat{r}$ that map to more than one interval, and the probability mass within those intervals remains constant. Therefore, the final result is: $\pi_{\hat{R}}(\hat{r}) = \left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right)$

For the same reason of disliking bounded probability spaces, we can apply a similar transformation to map the support of $p$ from $(0,1)$ to $\mathbb{R}$ , Define: $\omega = \log\left(\frac{p}{1-p}\right)$ which yields the following non-normalized prior distribution for $\omega$ : $\pi(\omega) = \left(\frac{1}{1+e^{-\omega}}\right)^{a_0}\cdot\left(\frac{e^{-\omega}}{1+e^{-\omega}}\right)^{b_0}$ The resulting posterior distribution in this parameterization is given by: $\pi(\omega,\hat{r}|X) \propto \binom{X-1}{r-1} \cdot \left(\frac{1}{1+e^{-\omega}}\right)^{r + a_0}\cdot\left(\frac{e^{-\omega}}{1+e^{-\omega}}\right)^{X - r + b_0 } \cdot\left(\frac{1}{1+e^{-\hat{r}}}\right)\cdot\left(\frac{e^{-\hat{r}}}{1+e^{-\hat{r}}}\right)$ Where $r = \left\lceil 1 + \frac{(X+1)-1}{1+e^{-\hat{r}}} \right\rceil - 1$ .

Ultimately, it can be easily shown that the negative logarithm of the posterior distribution and its derivative with respect to $\omega$ are as follows: $\begin{align*} -\log\pi(\omega,\hat{r}|X) \propto -\displaystyle\sum_{i = X - r + 1}^{X - 1} \log(i) &+ \displaystyle\sum_{i = 1}^{r-1}\log(i) + (X + a_0+b_0)\log\left(1+e^{-\omega}\right) \notag \\ &+ (X - r + b_0)\omega + \hat{r} + 2\log\left(1+e^{-\hat{r}}\right) \notag \end{align*}$ $\frac{\partial-\log\pi(\omega,\hat{r}|X)}{\partial\omega} = (X - r + b_0) - (X+a_0+b_0)\frac{e^{-\omega}}{1+e^{-\omega}}$ We can define an R function to evaluate these quantities. To be compatible with the package, the first argument should be the parameter values, the second should be the list of arguments necessary to evaluate the posterior, and the third should be a logical value: if TRUE, the function must return the negative logarithm of the posterior distribution; otherwise, it should return the gradient with respect to the continuous components, which, in this case, refers to the first element of the parameter vector.

#function for the negative log posterior and its gradient
#with respect to the continuous components
nlp <- function(par,args,eval_nlp = TRUE){
  
  if(eval_nlp){
    #only the negative log posterior
    
    #overflow
    if(any(par > 30)) return(Inf) 
    
    #conversion of the r value
    r <- ceiling(1 + args$data*plogis(par[2])) - 1
    
    #ensure that r is not zero
    if(r == 0){
      r <- 1
    }
    
    #output
    out <- sum(log(seq_len(r-1))) + 
      (args$data + args$hyp[1] + args$hyp[2])*log(1+exp(-par[1])) + 
      par[1]*(args$data - r + args$hyp[2]) + par[2] + 2*log(1+exp(-par[2]))
    if(r > 2) out <- out - sum(log(seq(args$data - r + 1,args$data - 1)))
    
    return(out)
    
  }else{
    #only the gradient
    
    #overflow
    if(any(par > 30)) return(Inf) 
    
    #conversion of the r value
    r <- ceiling(1 + args$data*plogis(par[2])) - 1
    
    #conversion of the r value
    r <- ceiling(1 + args$data*plogis(par[2])) - 1
    
    #output
    return( (args$data - r + args$hyp[2]) - 
              (args$data + args$hyp[1] + args$hyp[2])*(1-plogis(par[1])) )
  }
  
}

To run the algorithm, use the xdnuts function. For instance, if you want to run 4 chains, you need to provide a list of 4 initial vectors as the first argument.

#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) c(omega = rnorm(1),r_hat = rnorm(1))),
                 nlp = nlp,
                 args = arglist,
                 k = 1,
                 thin = 1,
                 parallel = FALSE,
                 method = "NUTS",
                 hide = TRUE)

After sampling, you can examine the model output with the print function.

chains

NUTS algorithm on 4 chains
Parameter dimension: 2
Number of discontinuous components: 1
Step size stochastic optimization procedure:
 - penalize deviations from nominal global acceptance rate
 - penalize deviations from nominal refraction rate 

Number of iteration: 1000
Number of recycled samples from each iteration: 0
Thinned samples: 1
Total sample from the posterior: 4000
Chains statistics:

 - E: energy of the system
 - dE: energy first differce
 - EBFMI = Var(E)/Var(dE): Empirical Bayesian Fraction of Missing Information
   A value lower than 0.2 it is a sign of sub optimal momenta distribution.
 - eps: step size of the algorithm
 - L: number of step done per iteration
 - alpha0: global acceptance rate
 - alpha1: refraction rate

        Me(E) Me(dE) EBFMI Me(eps) Me(L) alpha0 alpha1
chain1 20.351 -0.006 1.230   0.608     3  0.922  0.572
chain2 20.384  0.034 1.210   0.563     3  0.937  0.622
chain3 20.124  0.002 1.355   0.586     3  0.922  0.564
chain4 20.245  0.054 1.217   0.519     3  0.931  0.575

This will provide information about the algorithm used and some diagnostic metrics. Next, you can examine the results using the plot function:

plot(chains)

By default, it displays the trace plots for each marginal chain, which is useful for checking if the chains are mixing well. To view the marginal and bivariate densities, specify type = 2.

plot(chains, type = 2, gg = FALSE)

Examining the marginal densities can help diagnose poor convergence of the chains, though this does not seem to be an issue here.

Another useful plot can be generated with the type = 3 argument. This option overlays the energy level Markov chains. While it is less sensitive to convergence issues, it provides a summary of the global Markov process in a single chain, making it especially valuable for large models.

plot(chains, type = 3)

With type = 4, a stick plot of the trajectory lengths is displayed, with data from different chains overlaid. This plot serves as a good proxy for the computational cost of the procedure, as each step requires evaluating both the negative logarithmic posterior and its gradient multiple times.

plot(chains, type = 4)

With type = 5, the histogram of the marginal energy levels (in gray) is overlaid with the histogram of the corresponding proposal values (in red).

plot(chains, type = 5)

As with the classic Metropolis scheme, when the proposal distribution matches the target distribution well, there is effective exploration of the parameter space where the probability mass is highest. A poor match would indicate a suboptimal choice of the transition kernel, which in this case is determined solely by the choice of the momenta distribution. This would be reflected in a more leptokurtic red histogram compared to the gray histogram. For more details on this topic, see Betancourt (2016a).

With type = 6, the function plots the autocorrelation function for each chain and parameter. This is an effective way to diagnose slow exploration of the posterior distribution.

plot(chains, type = 6)

With type = 7, the function plots the empirical Metropolis acceptance rate and the refraction rate of the discontinuous component for each chain. This plot helps diagnose suboptimal adaptation of the step-size.

plot(chains, type = 7)

Last but not least, we can summarize the chain’s output using the summary function.

summary(chains)
       mean    sd     5%    25%   50%   75%   95%     ESS R_hat R_hat_upper_CI
omega 0.080 0.451 -0.661 -0.227 0.081 0.383 0.832 916.295 1.001          1.005
r_hat 0.074 0.530 -0.785 -0.276 0.072 0.412 0.979 886.159 1.002          1.005

Multivariate Gelman Test:  1.004
Estimated Bayesian Fraction of Missing Information:  0.803

Quantities of interest for each marginal distribution are returned, with the last two columns referring to the Potential Scale Reduction Factor statistic from Gelman and Rubin (1992). Values greater than 1.1 indicate non-convergence of the chains. Below the table, the multivariate version of this test is printed to the console, along with the Bayesian Fraction of Missing Information from Betancourt (2016a). If the latter is less than 0.2, it suggests slow exploration of the energy distribution. The latter is calculated by aggregating the energy from all chains. To see the value specific to each chain, use the print function, as showed above.

Example with only discontinuous components

It is possible to reuse the same models adopted previously while treating the first parameters as inducing a discontinuity as well. In this case, you should set k=2 in the xdnuts function. For teaching purposes, this time the algorithm with the exhaustion termination criterion from Betancourt (2016b) will be used, specified by setting method = "XHMC". In this scenario, you must specify a threshold value for this method. A reasonable but sub-optimal value for this threshold seems to be tau = 0.01.

#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) rnorm(2)),
                 nlp = nlp,
                 args = arglist,
                 k = 2,
                 thin = 1,
                 parallel = FALSE,
                 method = "XHMC",
                 hide = TRUE,
                 tau = 0.01)

This time, we transform the chains back to their original parameterization using the function xdtransform

#define the function to be applied to each sample
f <- function(x,XX) {
  c(
    plogis(x[1]), #inverse logistic for the probability
    ceiling(1 + XX*plogis(x[2])) - 1 #number of trials
  )
}
original_chains <- xdtransform(X = chains, which = NULL,
                               FUN = f,XX = arglist$data,
                               new.names = c("p","r"))

To avoid repetitions, not all plots are displayed.

plot(original_chains, type = 2, gg = FALSE)

The densities for each chain behave more appropriately, primarily due to reduced autocorrelation.

plot(original_chains, type = 4)

The improved performance observed earlier comes at the cost of an increased number of density evaluations.

plot(original_chains, type = 6)

As indicated by the first graph, the autocorrelations have significantly decreased, leading to faster convergence of the estimates.

summary(original_chains)
    mean    sd     5%    25%   50%    75%   95%      ESS R_hat R_hat_upper_CI
p  0.521 0.104  0.347  0.449  0.52  0.594  0.69 3557.410 1.002          1.005
r 26.542 6.158 16.000 22.000 27.00 31.000 37.00 3452.369 1.001          1.003

Multivariate Gelman Test:  1.002
Estimated Bayesian Fraction of Missing Information:  0.777

It is noteworthy that the Effective Sample Size (Geyer (2011)) has increased.

Example with only continuous components

For this example, we can use a Gaussian hierarchical model applied to the blood viscosity dataset, which is available in the package:

data(viscosity)
viscosity
  id Time.1 Time.2 Time.3 Time.4 Time.5 Time.6 Time.7
1  1     68     42     69     64     39     66     29
2  2     49     52     41     56     40     43     20
3  3     41     40     26     33     42     27     35
4  4     33     27     48     54     42     56     19
5  5     40     45     50     41     37     34     42
6  6     30     42     35     44     49     25     45

#create the list function
arglist <- list(data =  as.matrix(viscosity[,-1]),
                hyp = c(0, #mean a priori for mu
                        1000, #variance a priori for mu
                        0.5,1,0.5,1 #inverse gamma iperparameters
                        )
                )

This dataset contains 7 blood viscosity measurements for each of 6 different subjects. The model considered is as follows:

$\begin{align*} y_{ij} &\sim N(a_j,\sigma^2) \quad, i = 1,\dotsc,I\,,j = 1,\dotsc, J \notag \\ a_j &\sim N(\mu,\sigma_a^2) \notag \\ \mu &\sim N(m_0,s_0^2) \notag \\ \sigma^2 &\sim InvGamma(a_0,b_0) \notag \\ \sigma_a^2 &\sim InvGamma(a_1,b_1) \notag \end{align*}$

with $I = 7$ , $J = 6$ and $n = I\times J$ .

The variance parameters are transformed using a logarithmic scale to facilitate the algorithm’s exploration of the parameter space: $\omega = \log\sigma^2$ and $\omega_a = \log\sigma^2_a$ . In this case, only the final results are presented, as the intermediate calculations are not necessary.

$\begin{align*} -\log\pi(\theta|y) \propto &\quad\omega(n + a_0) + 0.5e^{-\omega}\left[2b_0 + \sum_{i,j} (y_{i,j} - a_j)^2 \right] \notag \\ &+ \omega_a(J + a_1) + 0.5e^{-\omega_a}\left[2b_1 + \sum_{j=1}^J(a_j - \mu)^2\right] + \frac{\mu^2 - 2\mu m_0}{2s_0^2} \notag \end{align*}$

$\begin{align*} \frac{\partial-\log\pi(\theta|y)}{\partial \mu} &= \frac{\mu - m_0}{s_0^2} - e^{-\omega_a}\displaystyle\sum_{j=1}^J (a_j - \mu)^2 \notag \\ \frac{\partial-\log\pi(\theta|y)}{\partial \omega} &= (n+a_0) - 0.5e^{-\omega}\left[2b_0 + \sum_{i,j} (y_{i,j} - a_j)^2\right] \notag \\ \frac{\partial-\log\pi(\theta|y)}{\partial \omega_a} &= (J+a_1) - 0.5e^{-\omega_a}\left[ 2b_1 + \displaystyle\sum_{j=1}^J (a_j - \mu)^2 \right] \end{align*}$

These computations are summarized in the following R function:

nlp <- function(par,args,eval_nlp = TRUE){
    
    if(eval_nlp){
      #only the negative log posterior

      #overflow
      if(any(abs(par[2:3]) > 30)) return(Inf)
      
      return(par[2] * ( prod(dim(args$data)) + args$hyp[3] ) + 
               (sum( (args$data - par[-(1:3)])^2 ) + 
                  2*args$hyp[4])*exp(-par[2])/2 +
               par[3] * (length(par[-(1:3)]) + 
                           args$hyp[5]) +  
               (sum( (par[-(1:3)] - par[1])^2 ) + 
                  2+args$hyp[6])*exp(-par[3])/2 +
               (par[1] - args$hyp[1])^2/2/args$hyp[2])
      
    }else{
      #only the gradient
      
      #overflow
      if(any(abs(par[2:3]) > 30)) return(rep(Inf,9))
      
      c(
        -sum( par[-(1:3)] - par[1] )*exp(-par[3]) + (
          par[1] - args$hyp[1])/args$hyp[2], #mu derivative
        
        prod(dim(args$data)) + args$hyp[3] - 
          (sum( (args$data - par[-(1:3)])^2 ) +
             2*args$hyp[4])*exp(-par[2])/2, #omega derivative
        
        length(par[-(1:3)]) + args$hyp[5] - 
          (sum( (par[-(1:3)] - par[1])^2 ) + 
             2+args$hyp[6])*exp(-par[3])/2, #omega_a derivative
        
        -apply(args$data - par[-(1:3)],1,sum)*exp(-par[2]) + 
          (par[-(1:3)] - par[1])*exp(-par[3]) #random effects gradient
      )
      
    }
  
}

The derivation of the gradient is recommended for computational efficiency; however, it is always possible to calculate it numerically using methods like the numDeriv::grad function.

Since this is the only algorithm not yet explored, we will use the standard Hamiltonian Monte Carlo with a fixed trajectory length L. In practice, the trajectory length L is sampled uniformly from the interval [max(1,L - L_jitter),L + L_jitter]. By default, L_jitter is set to 1, while L must be specified by the user. A sub-optimal but reasonable value for L is 31.

The current model is notably difficult to explore due to the centered parameterization (Gelfand, Sahu, and Carlin (1995)), so it seems reasonable to extend the warm-up phase to 1e3 for each chain. To determine if the default value of burn_adapt_ratio is reasonable, we save the warm-up samples and then inspect them.

#MCMC
set.seed(1)
chains <- xdnuts(theta0 = lapply(1:4,function(x) {
                      out <- rnorm(3 + 6)
                      names(out) <- c("mu","log_s2","log_s2_a",
                                      paste0("mu",1:6))
                      out}),
                 nlp = nlp,
                 args = arglist,
                 k = 0, #no discontinuous components
                 thin = 1,
                 parallel = FALSE,
                 method = "HMC",
                 hide = TRUE,
                 L = 31,
                 control = set_parameters(L_jitter = 5,
                                          N_adapt = 1e3,
                                          keep_warm_up = TRUE))

plot(chains,warm_up = TRUE)

As we can see, for the warm up phase, the default burn-in of $10\%$ seems reasonable.

We can transform the variance components back from their logarithmic parameterization using the xdtransform function. This time, only these components should be selected.

original_chains <- xdtransform(X = chains,which = 2:3,
                               FUN = exp,new.names = c("s2","s2_a"))

Since the parameter dimension is relatively high, we can examine the chain of energy level sets for a mixing diagnostic.

plot(original_chains, type = 3)

Although in a Bayesian context the only difference concerns the informativeness of the priors, we can plot the density plots of the fixed and random parameters separately.

plot(original_chains, type = 2, which = 1:3, gg = FALSE, scale = 0.5)#fixed

plot(original_chains, type = 2, which = 4:9, gg = FALSE, scale = 0.5)#random

Next, it is useful to check the autocorrelation plots for each chain:

plot(original_chains, type = 6)

As expected from theory, the centered parameterization of the model makes exploring the random effects variance components challenging, which is reflected in the strong autocorrelation of their chains.
Finally, the summary function:

summary(original_chains, q.val = c(0.05,0.5,0.95))
       mean     sd     5%    50%    95%      ESS R_hat R_hat_upper_CI
mu   41.824  1.339 39.612 41.867 44.016 4386.130 1.000          1.001
s2   71.180 11.373 54.595 70.178 91.548 3953.434 1.000          1.001
s2_a  0.868  1.177  0.233  0.549  2.467  856.378 1.084          1.103
mu1  42.704  1.691 40.099 42.623 45.466 2498.228 1.004          1.007
mu2  41.926  1.524 39.415 41.938 44.365 4349.091 1.001          1.002
mu3  41.296  1.601 38.606 41.383 43.821 3260.347 1.004          1.006
mu4  41.691  1.500 39.207 41.736 44.105 4831.208 1.004          1.009
mu5  41.786  1.507 39.306 41.810 44.229 3902.424 1.001          1.002
mu6  41.595  1.498 39.115 41.623 44.009 3973.342 0.999          0.999

Multivariate Gelman Test:  1.007
Estimated Bayesian Fraction of Missing Information:  1.444

By using the xdextract function, you can rearrange the chains into a more convenient format, such as an array or a matrix. This is useful for computing posterior quantities, including model predictions.

#extract samples into matrix
theta <- xdextract(original_chains,collapse = TRUE)

#compute prediction
y_hat <- sapply(1:6, function(i){
  rnorm(NROW(theta),theta[,3+i],sqrt(theta[,2]))
})

#plot prediction
op <- par(no.readonly = TRUE)
par(mar=c(5.1, 4.1, 2.1, 4.1), xpd=TRUE)
plot(NULL, xlim = c(1,6), ylim = c(15,85), xlab = "Subject",
     ylab =  "Viscosity", bty = "L")
for(i in 1:6){
  #data
  points(rep(i,7),viscosity[i,-1], pch = 16)
  
  #data 95% credible intervals for the prediction
  lines(rep(i,2),quantile(y_hat[,i],c(0.025,0.975)), col = 3, lwd = 3)
  
  #random effects 95% credible intervals
  lines(rep(i,2),quantile(theta[,3+i],c(0.025,0.975)), col = 4, lwd = 4)
}
legend("topright",inset=c(-0.2,-0.2), lty = 1, lwd = 2, col = c(3,4),
       title = "95% Credible Intervals",
       legend = c("prediction","random effects"),
       bty = "n", bg = "transparent", cex = 0.8)

par(op)

The investigation of the first subject, which appears to be an outlier, is beyond the scope of this analysis. However, the hierarchical model’s property of borrowing strength across subjects seems to be effective.

Bibliography

Betancourt, Michael. 2016a. “Diagnosing Suboptimal Cotangent Disintegrations in Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1604.00695.

———. 2016b. “Identifying the Optimal Integration Time in Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1601.00225.

Gelfand, Alan E, Sujit K Sahu, and Bradley P Carlin. 1995. “Efficient Parametrisations for Normal Linear Mixed Models.” Biometrika 82 (3): 479–88.

Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Geyer, Charles J. 2011. “Introduction to Markov Chain Monte Carlo.” Handbook of Markov Chain Monte Carlo 20116022 (45): 22.

Hoffman, Matthew D, Andrew Gelman, et al. 2014. “The No-u-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” J. Mach. Learn. Res. 15 (1): 1593–623.

Nishimura, Akihiko, and David Dunson. 2020. “Recycling Intermediate Steps to Improve Hamiltonian Monte Carlo.” Bayesian Analysis 15 (4). https://doi.org/10.1214/19-ba1171.

Nishimura, Akihiko, David B Dunson, and Jianfeng Lu. 2020. “Discontinuous Hamiltonian Monte Carlo for Discrete Parameters and Discontinuous Likelihoods.” Biometrika 107 (2): 365–80.

Tutorial for main functions

MCMC through the function xdnuts

Set algorithm’s features using the set_parameters function

Model diagnostic through the print, summary and plot functions

print

summary

plot

Posterior sampling post-processing through the xdtransform, xdextract functions

xdtransform

xdextract