Chapter A10: Accelerated EnvelopeBuild Implementation using OpenCL

Kjell Nygren

2026-04-30

library(glmbayes)

Chapter A10: Accelerated EnvelopeBuild Implementation using OpenCL

1. Introduction

The glmbayes package constructs envelopes for accept–reject sampling by evaluating the negative log posterior (f2) and its gradient (f3) at each point of a tangency grid. The grid size grows with dimension (e.g., \(3^\ell\) points for \(\ell\) coefficients), and each grid point can be evaluated independently. These evaluations are embarrassingly parallel and are a natural target for GPU acceleration.

The optional OpenCL implementation accelerates these evaluations by running them on a GPU (or OpenCL-capable CPU). OpenCL is vendor-neutral and works across NVIDIA, AMD, and Intel hardware. When OpenCL is available and use_opencl = TRUE, the package uses GPU-accelerated evaluation inside EnvelopeEval, which is called by EnvelopeBuild and EnvelopeDispersionBuild. If OpenCL is not available or use_opencl = FALSE, the CPU path (f2_f3_non_opencl) is used instead. The package runs correctly in either case.

This chapter describes the OpenCL implementation: the two-layer design (kernel wrapper and kernel runner), how the OpenCL program is assembled from ported nmath/rmath functions, and how the pilot and safeguards integrate with the envelope build. For installation and availability checks, see Chapter 12.

2. Architecture Overview

The OpenCL path consists of two layers:

Layer Function Location Role
Wrapper f2_f3_opencl kernel_wrappers.cpp Flatten R inputs, select kernel, load sources, call runner, reshape outputs
Runner f2_f3_kernel_runner kernel_runners.cpp Platform/device setup, buffer creation, kernel launch, readback

2.1 Call Path

EnvelopeEval (EnvelopeEval.cpp)
    |-- if use_opencl && pilot/check passes:
    |       f2_f3_opencl() -> f2_f3_kernel_runner()
    \-- else:
            f2_f3_non_opencl() -> CPU famfuncs (f2_f3_binomial_logit, etc.)

EnvelopeEval receives the grid G4 (coefficients at each tangency point), the design matrix, prior parameters, and family/link. It chooses the GPU or CPU path based on use_opencl and the outcome of the pilot (when applicable). The GPU path calls f2_f3_opencl, which assembles the OpenCL program and invokes f2_f3_kernel_runner; the CPU path calls f2_f3_non_opencl, which dispatches to the appropriate C++ family function.

3. Program Construction

OpenCL kernels are not stored as single monolithic files. Instead, the package builds the program by concatenating several source components in a fixed order. This mirrors a C/C++ build where headers and libraries are included before the main source.

3.1 Assembly Order

The full program string is:

all_src = OPENCL.cl
        + rmath (load_kernel_library("rmath"))
        + dpq   (load_kernel_library("dpq"))
        + nmath (load_kernel_library("nmath"))
        + kernel file (load_kernel_source, e.g. "src/f2_f3_binomial_logit.cl")
  1. OPENCL.cl – Global configuration: extensions (cl_khr_fp64, cl_khr_printf), IEEE constants (ML_NAN, ML_POSINF), feature detection for expm1/log1p, and utility macros.
  2. rmath – Mathematical constants (M_E, M_PI, etc.) and distribution function declarations.
  3. dpq – R-style density/CDF macros (R_D__0, R_DT_val, etc.) used in give_log/lower_tail logic.
  4. nmath – Ported numerical routines (bd0, stirlerr, lgamma, dbinom, dpois, pnorm, etc.).
  5. Kernel file – The model-specific kernel (e.g., f2_f3_binomial_logit.cl) that computes f2 and f3 for each grid point.

3.2 load_kernel_source and load_kernel_library

  • load_kernel_source(relative_path) – Loads a single .cl file from inst/cl/ via system.file("cl", relative_path, package = "glmbayes").
  • load_kernel_library(subdir) – Loads all .cl files in a subdirectory (e.g., "nmath", "rmath", "dpq"), parses @provides and @depends annotations, performs a dependency-aware topological sort, and concatenates the files so that dependents appear after their dependencies.

The @provides and @depends tags allow modular, maintainable kernel code: each file declares what it provides and what it needs. See ?load_kernel_library for details.

4. Ported Math Libraries (nmath / rmath / dpq)

The likelihood and gradient computations require statistical functions that match R’s behavior. Because OpenCL C does not include these, the package ports a core set from R’s nmath and rmath libraries to OpenCL C.

4.1 nmath

Module Provides Purpose
nmath.cl ML_, ME_, ISNAN, R_FINITE, ML_ERROR, etc. Constants, error handling, validation macros
log1p.cl Rlog1p Log(1+x) for numerical stability
expm1.cl expm1 exp(x)-1 for numerical stability
bd0.cl bd0, ebd0, etc. Poisson/binomial deviance terms
stirlerr.cl stirlerr Stirling error for factorial terms
lgamma.cl lgammafn Log-gamma
gamma.cl gammafn Gamma function
dbinom.cl dbinom, dbinom_raw Binomial density
dpois.cl dpois, dpois_raw Poisson density
dgamma.cl dgamma Gamma density
dnorm.cl dnorm4 Normal density
pnorm.cl pnorm5, pnorm_both Normal CDF (probit)
pgamma.cl pgamma Gamma CDF (inverse link)
d1mach.cl d1mach Machine constants
chebyshev.cl chebyshev_eval Polynomial evaluation

4.2 rmath

Rmath.cl provides additional constants (M_E, M_PI, M_LN2, etc.) and distribution functions consistent with R’s Rmath library.

4.3 dpq

dpq.cl and dpq_prelude.cl provide macros for density and CDF handling with give_log and lower_tail, matching R’s DPQ (density, probability, quantile) conventions.

These ports ensure that the OpenCL kernels produce results numerically consistent with the CPU path and with R itself.

6. f2_f3_opencl Flow

The wrapper f2_f3_opencl in kernel_wrappers.cpp performs:

  1. Input flattening – Convert R matrices/vectors (x, b, mu, P, alpha, y, wt) to contiguous C++ vectors in the layout expected by the runner.
  2. Kernel selection – Map (family, link) to kernel_name and kernel_file (e.g., binomial + logit → f2_f3_binomial_logit, src/f2_f3_binomial_logit.cl).
  3. Program assembly – Load OPENCL.cl, rmath, nmath, dpq, and the kernel file; concatenate into all_src.
  4. Runner call – Invoke f2_f3_kernel_runner(all_src, kernel_name, l1, l2, m1, ...) with flattened inputs and output buffers.
  5. Output reshaping – Copy qf_flat into an R vector; wrap grad_flat as an Armadillo matrix (m1 × l2) for return.

The returned list has components qf (negative log posterior per grid point) and grad (gradient matrix).

7. f2_f3_kernel_runner

The runner in kernel_runners.cpp handles the low-level OpenCL API:

  1. Platform and deviceclGetPlatformIDs, clGetDeviceIDs (CL_DEVICE_TYPE_DEFAULT).
  2. Context and queueclCreateContext, clCreateCommandQueueWithProperties.
  3. ProgramclCreateProgramWithSource with the concatenated string, clBuildProgram.
  4. KernelclCreateKernel with the kernel name.
  5. Buffers – Create read-only buffers for X, B, mu, P, alpha, y, wt; write-only for qf and grad. Copy host data for inputs.
  6. LaunchclEnqueueNDRangeKernel with global = m1 (one work item per grid point).
  7. ReadbackclEnqueueReadBuffer for qf and grad.
  8. Sanity check – If both outputs are all zeros, throw (likely kernel failure).
  9. Cleanup – Release buffers, kernel, program, queue, context.

The runner is GLM-specific and lives in the glmbayes::opencl namespace; the general OpenCL utilities (kernel loading, device enumeration) are in openclPort.

8. Pilot and Safeguards

For large grids (e.g., \(m_1 > 50{,}000\)), EnvelopeEval runs f2_f3_opencl_pilot before the full evaluation to estimate runtime.

8.1 Pilot Logic

  • Warm-up – Run f2_f3_opencl on a small grid slice.
  • Calibration – Run on slices of ~1% and ~2% of the grid to estimate fixed cost and per-grid cost.
  • Refined estimate – Compute refined_est_total_sec = per_grid_sec × m1.
  • 5-minute safeguard – If refined_est_total_sec > 300:
    • Interactive session: Prompt "Do you want to continue? [y/N]: ".
    • Non-interactive session: Proceed automatically (e.g., CI, batch).

This mirrors the parallel sampling pilots in Chapter A09. The envelope build phase cannot be interrupted once the full OpenCL run starts, so the pilot gives users a chance to decline long runs.

8.2 When the Pilot Runs

The pilot runs when m1_total > 50000 and use_opencl is true. For smaller grids, the full evaluation runs without a pilot.

9. Installation and Availability

OpenCL support is optional. The package compiles and runs without it; in that case, use_opencl is ignored and the CPU path is always used.

To enable OpenCL:

Use has_opencl() to check availability and diagnose_glmbayes() for a full diagnostic report. See Chapter 12 for platform-specific installation instructions (Windows, Linux, macOS) and AMD ROCm notes.

10. Cross-References