Chapter A10: Accelerated EnvelopeBuild Implementation using OpenCL

1. Introduction

The glmbayes package constructs envelopes for accept–reject sampling by evaluating the negative log posterior (f2) and its gradient (f3) at each point of a tangency grid. The grid size grows with dimension (e.g., \(3^\ell\) points for \(\ell\) coefficients), and each grid point can be evaluated independently. These evaluations are embarrassingly parallel and are a natural target for GPU acceleration.

The optional OpenCL implementation accelerates these evaluations by running them on a GPU (or OpenCL-capable CPU). OpenCL is vendor-neutral and works across NVIDIA, AMD, and Intel hardware. When OpenCL is available and use_opencl = TRUE, the package uses GPU-accelerated evaluation inside EnvelopeEval, which is called by EnvelopeBuild and EnvelopeDispersionBuild. If OpenCL is not available or use_opencl = FALSE, the CPU path (f2_f3_non_opencl) is used instead. The package runs correctly in either case.

This chapter describes the OpenCL implementation: the two-layer design (kernel wrapper and kernel runner), how the OpenCL program is assembled from ported nmath/rmath functions, and how the pilot and safeguards integrate with the envelope build. For installation and availability checks, see Chapter 16.

2. Architecture Overview

The OpenCL path consists of two layers:

Layer	Function	Location	Role
Wrapper	`f2_f3_opencl`	`kernel_wrappers.cpp`	Flatten R inputs, select kernel, load sources, call runner, reshape outputs
Runner	`f2_f3_kernel_runner`	`kernel_runners.cpp`	Platform/device setup, buffer creation, kernel launch, readback

2.1 Call Path

EnvelopeEval (EnvelopeEval.cpp)
    |-- if use_opencl && pilot/check passes:
    |       f2_f3_opencl() -> f2_f3_kernel_runner()
    \-- else:
            f2_f3_non_opencl() -> CPU famfuncs (f2_f3_binomial_logit, etc.)

EnvelopeEval receives the grid G4 (coefficients at each tangency point), the design matrix, prior parameters, and family/link. It chooses the GPU or CPU path based on use_opencl and the outcome of the pilot (when applicable). The GPU path calls f2_f3_opencl, which assembles the OpenCL program and invokes f2_f3_kernel_runner; the CPU path calls f2_f3_non_opencl, which dispatches to the appropriate C++ family function.

3. Program Construction

OpenCL kernels are not stored as single monolithic files. Instead, the package builds the program by concatenating several source components in a fixed order. This mirrors a C/C++ build where headers and libraries are included before the main source.

3.1 Assembly order (production)

Runtime GPU assembly uses load_likelihood_subgradient_program in src/kernel_loader.cpp (see inst/cl/README.md). Prelude, shims, and selective nmath/ are loaded from nmathopencl; entry kernels from glmbayes (inst/cl/src/f2_f3_*.cl).

For interactive exploration, mirror that split with opencltools:

all_src = opencltools::load_kernel_source("OPENCL.cl", package = "nmathopencl")
        + opencltools::load_kernel_library("libR_shims", package = "nmathopencl")
        + … (R_ext_types, R_shims, R_ext_runtime, R_ext_internals, System)
        + opencltools::load_library_for_kernel(
              system.file("cl/src/f2_f3_binomial_logit.cl", package = "glmbayes"),
              system.file("cl/nmath", package = "nmathopencl"),
              depends_tag = "all_depends_nmath")
        + opencltools::load_kernel_source("src/f2_f3_binomial_logit.cl", package = "glmbayes")

OPENCL.cl and shim layers — from nmathopencl.
nmath subset — stems from @all_depends_nmath on the entry kernel; files from nmathopencl inst/cl/nmath/.
Kernel file — model-specific kernel from glmbayes inst/cl/src/.

3.2 opencltools loaders

opencltools::load_kernel_source(relative_path, package = …) – Loads a single .cl file via system.file("cl", …).
opencltools::load_kernel_library(subdir, package = …) – Loads all .cl files in a subdirectory with @provides / @depends topological sort.
opencltools::load_library_for_kernel() – Minimal nmath subset from @all_depends_nmath tags (see ?opencltools::load_library_for_kernel).

The @provides and @depends tags allow modular, maintainable kernel code. See ?opencltools::load_kernel_library for details.

4. Ported Math Libraries (nmath / rmath / dpq)

The likelihood and gradient computations require statistical functions that match R’s behavior. Because OpenCL C does not include these, the package ports a core set from R’s nmath and rmath libraries to OpenCL C.

4.1 nmath

Module	Provides	Purpose
nmath.cl	ML_, ME_, ISNAN, R_FINITE, ML_ERROR, etc.	Constants, error handling, validation macros
log1p.cl	Rlog1p	Log(1+x) for numerical stability
expm1.cl	expm1	exp(x)-1 for numerical stability
bd0.cl	bd0, ebd0, etc.	Poisson/binomial deviance terms
stirlerr.cl	stirlerr	Stirling error for factorial terms
lgamma.cl	lgammafn	Log-gamma
gamma.cl	gammafn	Gamma function
dbinom.cl	dbinom, dbinom_raw	Binomial density
dpois.cl	dpois, dpois_raw	Poisson density
dgamma.cl	dgamma	Gamma density
dnorm.cl	dnorm4	Normal density
pnorm.cl	pnorm5, pnorm_both	Normal CDF (probit)
pgamma.cl	pgamma	Gamma CDF (inverse link)
d1mach.cl	d1mach	Machine constants
chebyshev.cl	chebyshev_eval	Polynomial evaluation

4.2 rmath

Rmath.cl provides additional constants (M_E, M_PI, M_LN2, etc.) and distribution functions consistent with R’s Rmath library.

4.3 dpq

dpq.cl and dpq_prelude.cl provide macros for density and CDF handling with give_log and lower_tail, matching R’s DPQ (density, probability, quantile) conventions.

These ports ensure that the OpenCL kernels produce results numerically consistent with the CPU path and with R itself.

5. Family/Link Kernels

Each supported family/link has a dedicated kernel file that implements the f2/f3 logic for that model.

Family	Link	Kernel file
binomial	logit	`f2_f3_binomial_logit.cl`
binomial	probit	`f2_f3_binomial_probit.cl`
binomial	cloglog	`f2_f3_binomial_cloglog.cl`
poisson	log	`f2_f3_poisson.cl`
Gamma	inverse	`f2_f3_gamma.cl`
gaussian	identity	`f2_f3_gaussian.cl`

5.1 Kernel Structure

Each kernel follows the same pattern:

Work-item mapping – int j = get_global_id(0); one work item per grid point \(j\).
Prior term – Compute \(P(\beta_j - \mu)\) and the quadratic form \(\tfrac{1}{2}(\beta_j - \mu)' P (\beta_j - \mu)\), accumulate into qf[j].
Prior gradient – Initialize local gradient with \(P(\beta_j - \mu)\).
Data loop – For each observation \(i\), compute linear predictor, link, likelihood contribution, and gradient contribution. Use ported functions (dbinom_raw, dpois_raw, lgamma, pnorm5, etc.) as needed.
Write outputs – qf[j] = negative log posterior; grad[k*m1 + j] = gradient (column-major layout).

The kernels use a fixed MAX_L2 for local arrays (e.g., 64); this limits the number of coefficients when using OpenCL. See the source for current limits.

6. f2_f3_opencl Flow

The wrapper f2_f3_opencl in kernel_wrappers.cpp performs:

Input flattening – Convert R matrices/vectors (x, b, mu, P, alpha, y, wt) to contiguous C++ vectors in the layout expected by the runner.
Kernel selection – Map (family, link) to kernel_name and kernel_file (e.g., binomial + logit → f2_f3_binomial_logit, src/f2_f3_binomial_logit.cl).
Program assembly – Load OPENCL.cl, rmath, nmath, dpq, and the kernel file; concatenate into all_src.
Runner call – Invoke f2_f3_kernel_runner(all_src, kernel_name, l1, l2, m1, ...) with flattened inputs and output buffers.
Output reshaping – Copy qf_flat into an R vector; wrap grad_flat as an Armadillo matrix (m1 × l2) for return.

The returned list has components qf (negative log posterior per grid point) and grad (gradient matrix).

7. f2_f3_kernel_runner

The runner in kernel_runners.cpp handles the low-level OpenCL API:

Platform and device – clGetPlatformIDs, clGetDeviceIDs (CL_DEVICE_TYPE_DEFAULT).
Context and queue – clCreateContext, clCreateCommandQueueWithProperties.
Program – clCreateProgramWithSource with the concatenated string, clBuildProgram.
Kernel – clCreateKernel with the kernel name.
Buffers – Create read-only buffers for X, B, mu, P, alpha, y, wt; write-only for qf and grad. Copy host data for inputs.
Launch – clEnqueueNDRangeKernel with global = m1 (one work item per grid point).
Readback – clEnqueueReadBuffer for qf and grad.
Sanity check – If both outputs are all zeros, throw (likely kernel failure).
Cleanup – Release buffers, kernel, program, queue, context.

The runner is GLM-specific and lives in the glmbayes::opencl namespace; the general OpenCL utilities (kernel loading, device enumeration) are in openclPort.

8. Pilot and Safeguards

For large grids (e.g., \(m_1 > 50{,}000\)), EnvelopeEval runs f2_f3_opencl_pilot before the full evaluation to estimate runtime.

8.1 Pilot Logic

Warm-up – Run f2_f3_opencl on a small grid slice.
Calibration – Run on slices of ~1% and ~2% of the grid to estimate fixed cost and per-grid cost.
Refined estimate – Compute refined_est_total_sec = per_grid_sec × m1.
5-minute safeguard – If refined_est_total_sec > 300:
- Interactive session: Prompt "Do you want to continue? [y/N]: ".
- Non-interactive session: Proceed automatically (e.g., CI, batch).

This mirrors the parallel sampling pilots in Chapter A09. The envelope build phase cannot be interrupted once the full OpenCL run starts, so the pilot gives users a chance to decline long runs.

8.2 When the Pilot Runs

The pilot runs when m1_total > 50000 and use_opencl is true. For smaller grids, the full evaluation runs without a pilot.

9. Installation and Availability

OpenCL support is optional. The package compiles and runs without it; in that case, use_opencl is ignored and the CPU path is always used.

To enable OpenCL:

Install OpenCL headers and development libraries (see Chapter 16).
Build the package from source with OpenCL enabled (configure script detects OpenCL).
Ensure an OpenCL runtime (ICD loader, vendor driver) is available at runtime.

Use has_opencl() to check availability and diagnose_glmbayes() for a full diagnostic report. See Chapter 16 for platform-specific installation instructions (Windows, Linux, macOS) and AMD ROCm notes.

10. Cross-References

Chapter A05 – Simulation pipeline (optimization, standardization, envelope build, sampling)
Chapter A08 – Envelope-related functions
Chapter A09 – Parallel sampling (pilot pattern, 5-minute safeguard)
Chapter 16 – GPU acceleration: installation, has_opencl, diagnose_glmbayes
?EnvelopeBuild, ?EnvelopeEval – use_opencl argument
?opencltools::load_kernel_library, ?opencltools::load_kernel_source – Program assembly
?glmb, ?rglmb – use_opencl argument

Chapter A10: Accelerated EnvelopeBuild Implementation using OpenCL

Kjell Nygren

2026-07-19

Chapter A10: Accelerated EnvelopeBuild Implementation using OpenCL

1. Introduction

2. Architecture Overview

2.1 Call Path

3. Program Construction

3.1 Assembly order (production)

3.2 opencltools loaders

4. Ported Math Libraries (nmath / rmath / dpq)

4.1 nmath

4.2 rmath

4.3 dpq

5. Family/Link Kernels

5.1 Kernel Structure

6. f2_f3_opencl Flow

7. f2_f3_kernel_runner

8. Pilot and Safeguards

8.1 Pilot Logic

8.2 When the Pilot Runs

9. Installation and Availability

10. Cross-References