The glmbayes package constructs envelopes for
accept–reject sampling by evaluating the negative log posterior (f2) and
its gradient (f3) at each point of a tangency grid. The
grid size grows with dimension (e.g., \(3^\ell\) points for \(\ell\) coefficients), and each grid point
can be evaluated independently. These evaluations are
embarrassingly parallel and are a natural target for
GPU acceleration.
The optional OpenCL implementation accelerates these
evaluations by running them on a GPU (or OpenCL-capable CPU). OpenCL is
vendor-neutral and works across NVIDIA, AMD, and Intel hardware. When
OpenCL is available and use_opencl = TRUE, the package uses
GPU-accelerated evaluation inside EnvelopeEval, which is
called by EnvelopeBuild and
EnvelopeDispersionBuild. If OpenCL is not available or
use_opencl = FALSE, the CPU path
(f2_f3_non_opencl) is used instead. The package runs
correctly in either case.
This chapter describes the OpenCL implementation: the two-layer design (kernel wrapper and kernel runner), how the OpenCL program is assembled from ported nmath/rmath functions, and how the pilot and safeguards integrate with the envelope build. For installation and availability checks, see Chapter 12.
The OpenCL path consists of two layers:
| Layer | Function | Location | Role |
|---|---|---|---|
| Wrapper | f2_f3_opencl |
kernel_wrappers.cpp |
Flatten R inputs, select kernel, load sources, call runner, reshape outputs |
| Runner | f2_f3_kernel_runner |
kernel_runners.cpp |
Platform/device setup, buffer creation, kernel launch, readback |
EnvelopeEval (EnvelopeEval.cpp)
|-- if use_opencl && pilot/check passes:
| f2_f3_opencl() -> f2_f3_kernel_runner()
\-- else:
f2_f3_non_opencl() -> CPU famfuncs (f2_f3_binomial_logit, etc.)
EnvelopeEval receives the grid G4
(coefficients at each tangency point), the design matrix, prior
parameters, and family/link. It chooses the GPU or CPU path based on
use_opencl and the outcome of the pilot (when applicable).
The GPU path calls f2_f3_opencl, which assembles the OpenCL
program and invokes f2_f3_kernel_runner; the CPU path calls
f2_f3_non_opencl, which dispatches to the appropriate C++
family function.
OpenCL kernels are not stored as single monolithic files. Instead, the package builds the program by concatenating several source components in a fixed order. This mirrors a C/C++ build where headers and libraries are included before the main source.
The full program string is:
all_src = OPENCL.cl
+ rmath (load_kernel_library("rmath"))
+ dpq (load_kernel_library("dpq"))
+ nmath (load_kernel_library("nmath"))
+ kernel file (load_kernel_source, e.g. "src/f2_f3_binomial_logit.cl")
cl_khr_fp64, cl_khr_printf), IEEE constants
(ML_NAN, ML_POSINF), feature detection for
expm1/log1p, and utility macros.R_D__0, R_DT_val, etc.) used in
give_log/lower_tail logic.f2_f3_binomial_logit.cl) that computes f2 and f3 for each
grid point.load_kernel_source(relative_path) –
Loads a single .cl file from inst/cl/ via
system.file("cl", relative_path, package = "glmbayes").load_kernel_library(subdir) – Loads
all .cl files in a subdirectory (e.g.,
"nmath", "rmath", "dpq"), parses
@provides and @depends annotations, performs a
dependency-aware topological sort, and concatenates the
files so that dependents appear after their dependencies.The @provides and @depends tags allow
modular, maintainable kernel code: each file declares what it provides
and what it needs. See ?load_kernel_library for
details.
The likelihood and gradient computations require statistical functions that match R’s behavior. Because OpenCL C does not include these, the package ports a core set from R’s nmath and rmath libraries to OpenCL C.
| Module | Provides | Purpose |
|---|---|---|
| nmath.cl | ML_, ME_, ISNAN, R_FINITE, ML_ERROR, etc. | Constants, error handling, validation macros |
| log1p.cl | Rlog1p | Log(1+x) for numerical stability |
| expm1.cl | expm1 | exp(x)-1 for numerical stability |
| bd0.cl | bd0, ebd0, etc. | Poisson/binomial deviance terms |
| stirlerr.cl | stirlerr | Stirling error for factorial terms |
| lgamma.cl | lgammafn | Log-gamma |
| gamma.cl | gammafn | Gamma function |
| dbinom.cl | dbinom, dbinom_raw | Binomial density |
| dpois.cl | dpois, dpois_raw | Poisson density |
| dgamma.cl | dgamma | Gamma density |
| dnorm.cl | dnorm4 | Normal density |
| pnorm.cl | pnorm5, pnorm_both | Normal CDF (probit) |
| pgamma.cl | pgamma | Gamma CDF (inverse link) |
| d1mach.cl | d1mach | Machine constants |
| chebyshev.cl | chebyshev_eval | Polynomial evaluation |
Rmath.cl provides additional constants (M_E, M_PI,
M_LN2, etc.) and distribution functions consistent with R’s
Rmath library.
dpq.cl and dpq_prelude.cl provide macros
for density and CDF handling with give_log and
lower_tail, matching R’s DPQ (density, probability,
quantile) conventions.
These ports ensure that the OpenCL kernels produce results numerically consistent with the CPU path and with R itself.
Each supported family/link has a dedicated kernel file that implements the f2/f3 logic for that model.
| Family | Link | Kernel file |
|---|---|---|
| binomial | logit | f2_f3_binomial_logit.cl |
| binomial | probit | f2_f3_binomial_probit.cl |
| binomial | cloglog | f2_f3_binomial_cloglog.cl |
| poisson | log | f2_f3_poisson.cl |
| Gamma | inverse | f2_f3_gamma.cl |
| gaussian | identity | f2_f3_gaussian.cl |
Each kernel follows the same pattern:
int j = get_global_id(0); one work item per grid point
\(j\).qf[j].dbinom_raw, dpois_raw, lgamma,
pnorm5, etc.) as needed.qf[j] = negative log
posterior; grad[k*m1 + j] = gradient (column-major
layout).The kernels use a fixed MAX_L2 for local arrays (e.g.,
64); this limits the number of coefficients when using OpenCL. See the
source for current limits.
The wrapper f2_f3_opencl in
kernel_wrappers.cpp performs:
(family, link)
to kernel_name and kernel_file (e.g., binomial
+ logit → f2_f3_binomial_logit,
src/f2_f3_binomial_logit.cl).all_src.f2_f3_kernel_runner(all_src, kernel_name, l1, l2, m1, ...)
with flattened inputs and output buffers.qf_flat into
an R vector; wrap grad_flat as an Armadillo matrix (m1 ×
l2) for return.The returned list has components qf (negative log
posterior per grid point) and grad (gradient matrix).
The runner in kernel_runners.cpp handles the low-level
OpenCL API:
clGetPlatformIDs, clGetDeviceIDs
(CL_DEVICE_TYPE_DEFAULT).clCreateContext,
clCreateCommandQueueWithProperties.clCreateProgramWithSource
with the concatenated string, clBuildProgram.clCreateKernel with the
kernel name.clEnqueueNDRangeKernel with
global = m1 (one work item per grid point).clEnqueueReadBuffer for qf
and grad.The runner is GLM-specific and lives in the
glmbayes::opencl namespace; the general OpenCL utilities
(kernel loading, device enumeration) are in openclPort.
For large grids (e.g., \(m_1 >
50{,}000\)), EnvelopeEval runs
f2_f3_opencl_pilot before the full evaluation to estimate
runtime.
f2_f3_opencl on a small
grid slice.refined_est_total_sec = per_grid_sec × m1.refined_est_total_sec > 300:
"Do you want to continue? [y/N]: ".This mirrors the parallel sampling pilots in Chapter A09. The envelope build phase cannot be interrupted once the full OpenCL run starts, so the pilot gives users a chance to decline long runs.
The pilot runs when m1_total > 50000 and
use_opencl is true. For smaller grids, the full evaluation
runs without a pilot.
OpenCL support is optional. The package compiles and
runs without it; in that case, use_opencl is ignored and
the CPU path is always used.
To enable OpenCL:
Use has_opencl() to check availability and
diagnose_glmbayes() for a full diagnostic report. See
Chapter 12 for platform-specific installation instructions (Windows,
Linux, macOS) and AMD ROCm notes.
has_opencl, diagnose_glmbayes?EnvelopeBuild,
?EnvelopeEval – use_opencl argument?load_kernel_library,
?load_kernel_source – Program
assembly?glmb,
?rglmb – use_opencl argument