--- title: "Chapter A09: Parallel Sampling Implementation using RcppParallel" author: "Kjell Nygren" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Chapter A09: Parallel Sampling Implementation using RcppParallel} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(glmbayes) ``` # Chapter A09: Parallel Sampling Implementation using RcppParallel ## 1. Introduction The `glmbayes` package uses **parallel sampling** for many envelope-based samplers because generating iid draws can be computationally expensive and acceptance rates may be low. By default, `use_parallel = TRUE`, so large fits often run on multiple cores via RcppParallel. However, RcppParallel-based sampling **cannot be interrupted**. Once the main parallel loop starts, the R session is blocked until it completes. Users may experience the sampler as "freezing" during long runs. To address this, the package runs **pilots** that estimate how long sampling will take and, in interactive sessions, asks users to opt in or out before the non-interruptible phase begins. This chapter describes the parallel sampling implementation, the pilot logic, and the interactive safeguards. Parallel **envelope construction** (e.g., in `EnvelopeDispersionBuild`) uses a similar pilot pattern and is noted briefly. GPU acceleration via OpenCL is covered in Chapter A10. ## 2. Where Parallel Sampling Is Used Parallel sampling is used in two main C++ paths: | Entry point | When | Notes | |-------------|------|-------| | `rNormalGLM_cpp` | Non-Gaussian GLMs (Poisson, binomial, Gamma, etc.) with `use_parallel = TRUE` and `n > 1` | Called by `rNormal_reg` | | `rIndepNormalGammaReg_cpp` | Gaussian with independent Normal–Gamma prior, `use_parallel = TRUE` and `n > 1` | Uses `rIndepNormalGammaReg_std_parallel_cpp` | **Serial fallback:** When `use_parallel = FALSE` or `n == 1`, the serial path is used instead. The serial path includes `Rcpp::checkUserInterrupt()` in its loop, so it can be interrupted (e.g., with Ctrl+C). ## 3. Why Parallel Sampling Cannot Be Interrupted RcppParallel uses Intel TBB (Threading Building Blocks) to distribute work across multiple threads. The R main thread is blocked while `RcppParallel::parallelFor()` runs. R's `R_CheckUserInterrupt()` is not safe to call from worker threads, and there is no supported way to cancel a running TBB task from R. As a result: - During the parallel phase, Ctrl+C and similar interrupts do not reliably stop the run. - The session may appear to hang until the run completes. The pilot phase mitigates this by estimating runtime *before* the full parallel run and giving users a chance to decline. ## 4. Pilot-Based Time Estimation and Opt-In The pilot flow consists of: 1. **Pilot run** – A small serial or parallel batch (e.g., 1 draw or a few draws) to measure per-draw or per-observation time. 2. **Calibration run** – A larger parallel batch (typically up to ~1% of `n`, capped so it takes at most ~5 minutes) to refine the time estimate. 3. **Time estimate** – Extrapolation to the full sample size `n`. 4. **Interactive safeguard** – If the estimate exceeds 5 minutes, the user is prompted in interactive sessions. 5. **Non-interactive behavior** – In non-interactive sessions (e.g., R CMD check, batch jobs), the run proceeds automatically without prompting. **Threshold:** 300 seconds (5 minutes). Above this, the user is asked whether to continue. **Prompt:** `"Estimated simulation exceeds 5 minutes. Continue? [y/N]: "` (or similar). **Responses:** `y`, `yes`, `1`, or `continue` → proceed; `n`, `no`, `2`, or empty → stop with an informative error. ## 5. rNormalGLM Pilot Logic (Non-Gaussian GLMs) For non-Gaussian GLMs (Poisson, binomial, Gamma, etc.), the `rNormalGLM` C++ path uses `run_rcppparallel_pilot()`: ### 5.1 Single-Draw Test A single draw is generated in **serial** mode to measure per-draw time and to detect problems early. ### 5.2 max_draws Cap and Zero Accepts If the pilot hits an internal `max_draws` cap with **zero** accepted draws, this indicates the envelope may be insufficiently tight. The code warns that: - The envelope may be too loose for the posterior. - The full run has no cap and cannot be interrupted. - If acceptance remains zero, the simulation may appear to hang indefinitely. Recommended actions include: - Set `use_opencl = TRUE` or increase the requested number of draws (which affects `n_envopt` and may tighten the envelope). - Try a different `Gridtype` to force a tighter envelope. - Strengthen the prior. An interactive prompt then asks: `"Enter 1 to continue full run, 2 to stop and return partial results: "`. If the user chooses 2, partial test results are returned and the full run is aborted. ### 5.3 Calibration Run and Refined Estimate The calibration batch size is: \[ m_{\text{stage}} = \min\left( \lceil 0.01 \cdot n \rceil,\; \lfloor 300000 / \text{est\_per\_draw\_ms} \rfloor \right), \] where 300000 ms $\approx$ 5 minutes. The calibration run uses `parallelFor(0, m_stage, worker)` to measure per-draw time empirically. From the calibration: - `per_candidate_sec` = calibration elapsed time / total candidates used - `est_per_draw_sec` = `per_candidate_sec` × `E_draws` (expected candidates per accepted draw) - `est_total_sec` = `est_per_draw_sec` × `n` ### 5.4 Five-Minute Safeguard If `est_total_sec > 300`: - **Interactive:** Prompt `"Do you want to continue? [y/N]: "` - **Non-interactive:** Proceed automatically with a note. ### 5.5 Diagnostics (verbose = TRUE) With `verbose = TRUE`, the pilot reports: - Estimated simulation time for `n` draws - Note that the phase uses RcppParallel and cannot be safely interrupted - Calibration elapsed time and number of draws - `avg_candidates_per_draw` (empirical) vs `E_draws` - `per_candidate_sec` and `est_per_draw_sec` ## 6. rIndepNormalGammaReg Pilot Logic For Gaussian regression with independent Normal–Gamma priors, the pilot logic is similar in spirit but tailored to joint \((\beta, \phi)\) sampling: ### 6.1 Single-Draw Test One observation is generated in serial mode to measure per-observation time. ### 6.2 Calibration Run \[ m_{\text{stage}} = \min\left( \lceil 0.01 \cdot n \rceil,\; \lfloor 300000 / \text{per\_obs\_ms\_serial} \rfloor \right). \] The calibration run is parallel: `parallelFor(0, m_stage, worker)`. ### 6.3 Time Estimate - `per_obs_sec` = calibration elapsed time / `m_stage` - `est_total_sec` = `per_obs_sec` × `n` ### 6.4 Five-Minute Safeguard Same as rNormalGLM: if `est_total_sec > 300`, prompt in interactive sessions. ## 7. EnvelopeDispersionBuild Pilot (Envelope Construction) The RSS and UB2 steps in `EnvelopeDispersionBuild` use closed-form bounds (`bound_rss_over_dispersion` and `bound_ub2_over_dispersion`) and no longer perform RSS/UB2 minimization or pilot timing blocks. ### 7.1 Five-Minute Safeguard Since minimization/pilots are disabled, the time-estimate safeguard is not used. ## 8. Choosing Serial vs Parallel | Scenario | Recommendation | |----------|----------------| | **Debugging or small runs** | `use_parallel = FALSE` – serial path is interruptible | | **Large production runs** | `use_parallel = TRUE` – faster; rely on the pilot and prompt | | **Scripts / non-interactive** | No prompt; runs proceed automatically | To disable parallel sampling: ```{r eval=FALSE} glmb(formula, data, family, pfamily, n = 5000, use_parallel = FALSE) rglmb(n = 5000, y, x, family, pfamily, use_parallel = FALSE) ``` Serial sampling is slower but can be interrupted. Parallel sampling is faster but non-interruptible once the main loop starts. ## 9. Technical Details ### 9.1 RcppParallel Worker Pattern The parallel samplers use the standard RcppParallel `Worker` pattern: - A struct inheriting from `RcppParallel::Worker` - `operator()(size_t begin, size_t end)` processes the range `[begin, end)` - Shared outputs use `RcppParallel::RVector` or `RcppParallel::RMatrix` for thread-safe writes - Each thread has its own RNG state to avoid contention ### 9.2 Thread Safety - RNG: Thread-local state; no sharing across workers - Callbacks: The negative log-posterior and gradient functions (`f2`, `f3`) are called from workers; a mutex protects R callback invocations where needed - Output: Write-only ranges per worker; no overlapping writes ### 9.3 Dependencies - **RcppParallel** – Provides TBB and the `parallelFor` interface - **TBB** – Bundled with RcppParallel; no separate installation required ## 10. Troubleshooting | Symptom | Likely cause | Suggestion | |---------|--------------|------------| | Sampler appears to hang | Long parallel run | Use `verbose = TRUE` to see pilot estimates before the main run | | Zero accepts in pilot | Envelope too loose | Try `use_opencl = TRUE`, larger `n`, different `Gridtype`, or stronger prior | | Prompt never appears | Non-interactive session | Expected; runs proceed automatically in batch/CI | | Want to interrupt a run | Parallel phase has started | Cannot interrupt; use `use_parallel = FALSE` for interruptible runs | ## 11. Cross-References - **Chapter A05** – Simulation pipeline (optimization, standardization, envelope, sampling) - **Chapter A08** – Envelope-related functions - **Chapter A10** – OpenCL acceleration for envelope construction - **`?glmb`**, **`?rglmb`**, **`?lmb`**, **`?rlmb`** – `use_parallel` argument - **NEWS.md** – Parallel sampling and pilot-related changes