--- title: "Benchmarking PLS1 Implementations" shorttitle: "Benchmarking PLS1 Implementations" author: - name: "Frédéric Bertrand" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Benchmarking PLS1 Implementations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup_ops, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/benchmarking-pls1-", fig.width = 7, fig.height = 5, dpi = 150, message = FALSE, warning = FALSE ) LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE") set.seed(2025) ``` ```{r setup, message=FALSE} library(bigPLSR) library(bigmemory) library(bench) set.seed(123) ``` ## Overview The unified `pls_fit()` interface now drives both the dense and streaming implementations of single-response partial least squares regression. This vignette revisits the benchmarking workflow with the modern API and introduces two complementary perspectives: 1. **Internal comparisons** that contrast the dense (in-memory) and streaming (big-memory) backends of `pls_fit()`. 2. **External references** recorded against popular packages such as `pls` and `mixOmics`. These results are stored in the package to keep the vignette lightweight while still documenting performance relative to the wider ecosystem. The chunks tagged with `eval = LOCAL` are only executed when the environment variable `LOCAL` is set to `TRUE`, allowing CRAN checks to skip the more time-consuming benchmarks. ## Simulated data We create a synthetic regression problem with a modest latent structure and keep both dense and big-memory versions of the predictors and response so they can be reused in the benchmarking chunks. Here is an example with `n=4000` en `p=50` ```{r data-generation} n <- 1500 p <- 80 ncomp <- 6 X <- bigmemory::big.matrix(nrow = n, ncol = p, type = "double") X[,] <- matrix(rnorm(n * p), nrow = n) y_vec <- scale(X[,] %*% rnorm(p) + rnorm(n)) y <- bigmemory::big.matrix(nrow = n, ncol = 1, type = "double") y[,] <- y_vec X[1:6, 1:6] y[1:6,] ``` ## Internal benchmarks The following chunk compares dense vs. streaming fits for both SIMPLS and NIPALS. The dense backend receives base R matrices, while the streaming backend consumes the `big.matrix` objects directly. ```{r internal-benchmark, eval=LOCAL, cache=TRUE} internal_bench <- bench::mark( dense_simpls = pls_fit(as.matrix(X[]), y_vec, ncomp = ncomp, backend = "arma", algorithm = "simpls"), streaming_simpls = pls_fit(X, y, ncomp = ncomp, backend = "bigmem", algorithm = "simpls", chunk_size = 512L), dense_nipals = pls_fit(as.matrix(X[]), y_vec, ncomp = ncomp, backend = "arma", algorithm = "nipals"), streaming_nipals = pls_fit(X, y, ncomp = ncomp, backend = "bigmem", algorithm = "nipals", chunk_size = 512L), dense_kernelpls = pls_fit(as.matrix(X[]), y_vec, ncomp = ncomp, backend = "arma", algorithm = "kernelpls"), streaming_kernelpls = pls_fit(X, y, ncomp = ncomp, backend = "bigmem", algorithm = "kernelpls", chunk_size = 512L), dense_widekernelpls = pls_fit(as.matrix(X[]), y_vec, ncomp = ncomp, backend = "arma", algorithm = "widekernelpls"), streaming_widekernelpls = pls_fit(X, y, ncomp = ncomp, backend = "bigmem", algorithm = "widekernelpls", chunk_size = 512L), iterations = 20, check = FALSE ) internal_bench_res <-internal_bench[,2:5] internal_bench_res <- as.matrix(internal_bench_res) rownames(internal_bench_res) <- names(internal_bench$expression) ``` ```{r internal-benchmark-plot, eval=LOCAL, cache=TRUE} dotchart(internal_bench_res[,2], labels=rownames(internal_bench_res),xlab="median_time_s") dotchart(internal_bench_res[,3], labels=rownames(internal_bench_res),xlab="itr_per_sec") dotchart(internal_bench_res[,4], labels=rownames(internal_bench_res),xlab="mem_alloc_bytes") ``` The results highlight the trade-off between throughput and memory usage: SIMPLS shines on dense matrices, whereas the streaming backend scales to larger-than-memory inputs thanks to block processing. ## External references To avoid heavy dependencies at build time we ship a pre-computed benchmark dataset that contrasts `bigPLSR` with implementations from the `pls` and `mixOmics` packages. The dataset was generated with the helper script stored in `inst/scripts/external_pls_benchmarks.R`. ```{r external-benchmark} data("external_pls_benchmarks", package = "bigPLSR") sub_pls1 <- subset(external_pls_benchmarks,task=="pls1" & !algorithm=="widekernelpls") sub_pls1$n <- factor(sub_pls1$n) sub_pls1$p <- factor(sub_pls1$p) sub_pls1$q <- factor(sub_pls1$q) sub_pls1$ncomp <- factor(sub_pls1$ncomp) replications(~package+algorithm+task+n+p+ncomp,data=sub_pls1) sub_pls1_wide <- subset(external_pls_benchmarks,external_pls_benchmarks$task=="pls1" & algorithm=="widekernelpls") sub_pls1_wide$n <- factor(sub_pls1_wide$n) sub_pls1_wide$p <- factor(sub_pls1_wide$p) sub_pls1_wide$q <- factor(sub_pls1_wide$q) sub_pls1_wide$ncomp <- factor(sub_pls1_wide$ncomp) replications(~package+algorithm+task+n+p+ncomp,data=sub_pls1_wide) sub_pls2 <- subset(external_pls_benchmarks,external_pls_benchmarks$task=="pls2" & !algorithm=="widekernelpls") sub_pls2$n <- factor(sub_pls2$n) sub_pls2$p <- factor(sub_pls2$p) sub_pls2$q <- factor(sub_pls2$q) sub_pls2$ncomp <- factor(sub_pls2$ncomp) replications(~package+algorithm+task+n+p+ncomp,data=sub_pls2) sub_pls2_wide <- subset(external_pls_benchmarks,external_pls_benchmarks$task=="pls2" & algorithm=="widekernelpls") sub_pls2_wide$n <- factor(sub_pls2_wide$n) sub_pls2_wide$p <- factor(sub_pls2_wide$p) sub_pls2_wide$q <- factor(sub_pls2_wide$q) sub_pls2_wide$ncomp <- factor(sub_pls2_wide$ncomp) replications(~package+algorithm+task+n+p+ncomp,data=sub_pls2_wide) ``` ```{r external-sample-result} sub_pls1 ``` The table reports median execution times (in seconds), the number of iterations and memory use per second for a representative single-response scenario. The notes column indicates the additional packages required to reproduce those measurements. ## Takeaways Dense vs streaming backends. On small/medium data that fits in RAM, in-memory implementations (e.g., pls) are typically fastest (median ≈0.36 s for SIMPLS in our runs). However, they materialize large cross-products/Gram matrices and memory grows as O(p^2) (or O(n^2) in kernel views). In contrast, bigPLSR’s streaming big-memory backend keeps memory bounded via chunked BLAS and never forms those intermediates. In our PLS2 benchmark, streaming used ~7–8× less RAM than pls (≈89 MB vs ≈732 MB median) while remaining competitive in runtime (≈3.5 s vs 0.36 s). PLS1 shows the same pattern: streaming is often fast enough while dramatically reducing memory. As n or p grow, the streaming backend scales where dense approaches become memory-limited.