Once you have OpenCL installed (Chapter 01) and understand how to assemble a program from annotated library shards (Chapter 02), the next step is to wire GPU computation into your R package in a maintainable way.
glmbayes provides the canonical reference implementation
of this pattern. It uses a two-layer architecture — a
kernel runner that manages the raw OpenCL API, and a kernel
wrapper that adapts R inputs and outputs — together with a “fail
gracefully” strategy that falls back to CPU execution whenever OpenCL is
unavailable or slow.
This chapter describes both layers and the fail-gracefully pattern so
you can replicate them in your own package. All source code referenced
here lives in glmbayes; see
?glmbayes::EnvelopeEval for entry points.
The two-layer design cleanly separates concerns:
| Layer | Name | File | Responsibility |
|---|---|---|---|
| Wrapper | f2_f3_opencl |
kernel_wrappers.cpp |
Flatten R inputs, select kernel, assemble program, call runner, reshape outputs |
| Runner | f2_f3_kernel_runner |
kernel_runners.cpp |
Platform/device setup, buffer management, kernel launch, result readback |
EnvelopeEval (R or C++)
|-- if use_opencl && pilot/check passes:
| f2_f3_opencl() -> f2_f3_kernel_runner()
\-- else:
f2_f3_non_opencl() -> CPU family functions
EnvelopeEval receives a coefficient grid, design matrix,
prior parameters, and family/link specification. It selects the GPU or
CPU path and dispatches accordingly. The GPU path calls
f2_f3_opencl, which assembles the OpenCL program and
invokes f2_f3_kernel_runner; the CPU path calls
f2_f3_non_opencl, which dispatches to the appropriate C++
function.
Every family/link has a dedicated .cl file that
implements the f2 (negative log posterior) and f3 (gradient)
computations for one model. Each kernel follows the same five-step
pattern:
One OpenCL work item handles one grid point j. The total
number of work items equals m1 (the size of the tangency
grid), so all grid points are evaluated in parallel.
// tmp[k] = [P * (B_j - mu)]_k
double tmp[MAX_L2];
for (int k = 0; k < l2; ++k) {
double acc = 0.0;
for (int ell = 0; ell < l2; ++ell)
acc += P[k*l2 + ell] * (B[j*l2 + ell] - mu[ell]);
tmp[k] = acc;
}
// Quadratic form 0.5 * (B_j - mu)' * P * (B_j - mu)
double res_acc = 0.0;
for (int k = 0; k < l2; ++k)
res_acc += (B[j*l2 + k] - mu[k]) * tmp[k];
res_acc *= 0.5;For each observation i, compute the linear predictor,
apply the link function, add the likelihood contribution to
res_acc, and accumulate the gradient. This step calls the
ported nmath functions (e.g. dnorm4, dgamma,
dbinom_raw, pnorm5):
for (int i = 0; i < l1; ++i) {
double dot = alpha[i];
for (int k = 0; k < l2; ++k)
dot += X[k*l1 + i] * B[j*l2 + k];
// Gaussian example: -log dnorm(y | mean=dot, sd=1/sqrt(wt))
double sd_i = 1.0 / sqrt(wt[i]);
double ll = dnorm4(y[i], dot, sd_i, 1); // give_log = 1
res_acc -= ll;
double resid = wt[i] * (dot - y[i]);
for (int k = 0; k < l2; ++k)
g_loc[k] += X[k*l1 + i] * resid;
}qf[j] holds the negative log posterior for grid point
j; grad holds the gradient in column-major
layout (dimension l2 × m1).
| Family | Link | Kernel file | nmath calls |
|---|---|---|---|
| binomial | logit | f2_f3_binomial_logit.cl |
dbinom_raw |
| binomial | probit | f2_f3_binomial_probit.cl |
dbinom_raw, dnorm4,
pnorm5 |
| binomial | cloglog | f2_f3_binomial_cloglog.cl |
dbinom_raw, expm1 |
| Poisson | log | f2_f3_poisson.cl |
(uses OpenCL builtin lgamma) |
| Gamma | inverse | f2_f3_gamma.cl |
dgamma |
| Gaussian | identity | f2_f3_gaussian.cl |
dnorm4 |
Each file is annotated with @calls_nmath,
@depends_nmath, and @all_depends_nmath (see
Chapter 02, §6) so the kernel runner loads only the library shards it
needs.
The wrapper (f2_f3_opencl in
kernel_wrappers.cpp) is an Rcpp-exported function. Its
responsibilities are:
Input flattening — convert R
NumericMatrix / NumericVector to contiguous
std::vector<double> in the memory layout the runner
expects (flattenMatrix, copyVector from
openclPort).
Kernel selection — map
(family, link) to a kernel name and kernel file path:
Program assembly — call
load_library_for_kernel() for the minimal nmath subset,
then concatenate:
Runner call — invoke
f2_f3_kernel_runner(all_src, kernel_name, ...).
Output reshaping — wrap the flat qf
and grad results into R NumericVector and
arma::mat for return to R.
The wrapper can use opencltools C++ utilities by adding
to the downstream package’s DESCRIPTION:
LinkingTo: opencltools, Rcpp, RcppArmadillo
This gives access to openclPort.h which declares
flattenMatrix, copyVector,
configureOpenCL, and
load_library_for_kernel.
The runner (f2_f3_kernel_runner in
kernel_runners.cpp) handles the low-level OpenCL API. Each
call is stateless — it sets up a complete OpenCL context, launches the
kernel, reads back results, and releases all resources:
1. clGetPlatformIDs / clGetDeviceIDs (CL_DEVICE_TYPE_DEFAULT)
2. clCreateContext / clCreateCommandQueueWithProperties
3. clCreateProgramWithSource(all_src) + clBuildProgram
4. clCreateKernel(kernel_name)
5. Create read/write buffers; copy input data to device
6. clEnqueueNDRangeKernel (global_size = m1, local_size = NULL)
7. clEnqueueReadBuffer for qf and grad
8. Sanity check: if both outputs are all-zero, throw (kernel build failure)
9. Release buffers, kernel, program, queue, context
The runner is GLM-specific (it knows the buffer layout for X, B, mu,
P, alpha, y, wt, qf, grad) but the OpenCL plumbing portion is fully
generic. The general utilities — kernel loading, device enumeration,
fp64 probing — live in openclPort and are
re-exported by opencltools for use by any downstream
package.
The fail-gracefully pattern is the key architectural decision that makes GPU acceleration optional in a downstream package. The idea is that every GPU-dispatching function has a CPU fallback of identical interface:
// In EnvelopeEval.cpp (glmbayes)
if (use_opencl && opencl_pilot_ok) {
result = f2_f3_opencl(X, B, mu, P, alpha, y, wt, family, link);
} else {
result = f2_f3_non_opencl(X, B, mu, P, alpha, y, wt, family, link);
}f2_f3_non_opencl is a plain C++ function that calls the
same family functions (f2_f3_binomial_logit,
f2_f3_gaussian, etc.) without any OpenCL involvement.
This pattern means:
has_opencl() returning
FALSE — the package still installs and works;
use_opencl is silently ignored.For large workloads (e.g. a tangency grid with > 50,000 points),
glmbayes runs a pilot before committing to
the full GPU evaluation. The pilot:
refined_est_total_sec = per_grid_sec × m1.refined_est_total_sec > 300:
This pattern is reusable in any package that dispatches large parallel workloads to a GPU. The 5-minute threshold and prompt wording can be tuned to the specific workload.
glmbayes calls configureOpenCL() during
package initialization to probe the OpenCL device for native
expm1 and log1p support. The probe compiles a
tiny test kernel and checks the result:
expm1(0) accurately → define
USE_OPENCL_EXPM1.log1p(0) accurately → define
USE_OPENCL_LOG1P.These flags are passed as -D options to
clBuildProgram and control whether the nmath shims
(expm1.cl, log1p.cl) use the native builtin or
the ported fallback.
// In your package's initialization (e.g., called from .onLoad)
OpenCLConfig cfg = configureOpenCL();
// cfg.build_options contains -DUSE_OPENCL_EXPM1 etc. as appropriate
// Pass cfg.build_options to clBuildProgramconfigureOpenCL is declared in openclPort.h
and is available to any package that lists opencltools
under LinkingTo.
has_opencl()load_library_for_kernel()?load_kernel_source,
?load_kernel_library?load_library_for_kernelglmbayes::EnvelopeEval,
glmbayes::EnvelopeBuild — reference
implementationNygren, K. N., & Nygren, L. M. (2006). Likelihood subgradient densities. Journal of the American Statistical Association, 101(475), 1144–1156. https://doi.org/10.1198/016214506000000357
The f2_f3_* kernels documented here evaluate negative
log-posterior and gradient terms that feed the likelihood-subgradient
envelope construction from this paper.
Nygren, K. N. (2026). opencltools: OpenCL Tools for R Package
Developers. R package. citation("opencltools").