--- title: "RKHS-based Algorithms in bigPLSR" shorttitle: "RKHS-based Algorithms" author: - name: "Frédéric Bertrand" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{RKHS-based Algorithms in bigPLSR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup_ops, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "figures/rkhs-", fig.width = 7, fig.height = 5, dpi = 150, message = FALSE, warning = FALSE ) LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE") set.seed(2025) ``` ## Overview bigPLSR implements two kernel-based partial least squares solvers: - `algorithm = "rkhs"` (Rosipal & Trejo style) projects only the predictor matrix \(X\) into an RKHS; - `algorithm = "rkhs_xy"` projects both \(X\) and the response matrix \(Y\) into (possibly different) RKHSs and couples the latent scores through a regularised cross-covariance operator. Both solvers are available for dense matrices and for `bigmemory::big.matrix` objects. The big-memory paths stream kernel blocks and persist centering statistics so predictions remain cheap. ## Dense example ```{r dense-example} library(bigPLSR) set.seed(42) n <- 120; p <- 8; m <- 2 X <- matrix(rnorm(n * p), n, p) Y <- cbind( sin(X[, 1]) + 0.3 * X[, 2]^2 + rnorm(n, sd = 0.1), cos(X[, 3]) - 0.2 * X[, 4] + rnorm(n, sd = 0.1) ) fit_rkhs <- pls_fit(X, Y, ncomp = 3, algorithm = "rkhs", kernel = "rbf", gamma = 1 / p, scores = "r") options(bigPLSR.rkhs_xy.lambda_x = 1e-6) options(bigPLSR.rkhs_xy.lambda_y = 1e-6) fit_rkhs_xy <- pls_fit(X, Y, ncomp = 3, algorithm = "rkhs_xy", kernel = "rbf", gamma = 1 / p, scores = "none") head(predict(fit_rkhs, X)) head(predict(fit_rkhs_xy, X)) ``` Both fits run in well under five seconds for this moderately sized example. The RKHS-XY variant stores kernel centering statistics for both sides so that `predict()` can re-use them without recomputing the entire Gram matrix. ## Streaming example ```{r, eval=FALSE} library(bigmemory) Xbm <- as.big.matrix(X) Ybm <- as.big.matrix(Y) fit_stream <- pls_fit(Xbm, Ybm, ncomp = 3, backend = "bigmem", algorithm = "rkhs", kernel = "rbf", gamma = 1 / p, chunk_size = 1024L, scores = "none") ``` The streaming call attaches training descriptors (`$X_ref`) and kernel centering summaries (`$kstats`) automatically. When `predict()` is invoked on new data with `Xtrain = fit_stream$X_ref`, the package streams the cross-kernel blocks and avoids materialising the full \(n_\text{new} \times n_\text{train}\) Gram matrix. ## Logistic response Kernel logistic PLS (`algorithm = "klogitpls"`) builds on the RKHS infrastructure. After extracting latent scores from the centered Gram matrix the algorithm runs a logistic IRLS procedure in score space with support for class weighting and optional alternating score updates. Small datasets (hundreds of observations) remain well within the five-second budget. ```{r, eval=FALSE} y <- as.integer(X[, 1]^2 + X[, 2]^2 + rnorm(n, sd = 0.2) > 1) fit_logit <- pls_fit(X, y, ncomp = 2, algorithm = "klogitpls", kernel = "rbf", gamma = 1 / p) mean(predict(fit_logit, X)) ```