---
title: "Fast Principal Component Analysis for Big Data with bigPCAcpp"
author: "Frédéric Bertrand"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fast Principal Component Analysis for Big Data with bigPCAcpp}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

The **bigPCAcpp** package provides principal component analysis (PCA) routines
that operate directly on [`bigmemory::big.matrix`][bigmemory] objects.
This vignette walks through a complete analysis workflow and compares the
results against the reference implementation from base R's
[`prcomp()`][prcomp] to demonstrate numerical agreement.

[bigmemory]: https://cran.r-project.org/package=bigmemory
[prcomp]: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prcomp.html

We will use the classic `iris` measurement data as a small, in-memory example.
Even for larger data sets stored on disk, the workflow is identical once a
`big.matrix` descriptor has been created.

## Preparing a `big.matrix`

```{r bigmatrix-setup}
library(bigmemory)
library(bigPCAcpp)

iris_mat <- as.matrix(iris[, 1:4])
big_iris <- as.big.matrix(iris_mat, type = "double")
```

Every **bigPCAcpp** entry point accepts the `big.matrix` object directly (and, for
compatibility, still works with external pointers via the `@address` slot),
allowing analyses without copying data into regular R matrices.

## Running PCA with bigPCAcpp

```{r pca-run}
big_pca <- pca_bigmatrix(
  xpMat = big_iris,
  center = TRUE,
  scale = TRUE,
  ncomp = 4L,
  block_size = 128L
)
str(big_pca)
```

The returned list mirrors the structure of a `prcomp` object: singular values
(`sdev`), rotation matrix (`rotation`), optional centering and scaling vectors,
and additional diagnostics including the covariance matrix and explained
variance ratios.

## Comparing against `prcomp`

```{r prcomp-compare}
base_pca <- prcomp(iris_mat, center = TRUE, scale. = TRUE)

align_columns <- function(reference, target) {
  aligned <- target
  cols <- min(ncol(reference), ncol(target))
  for (j in seq_len(cols)) {
    ref <- reference[, j]
    tgt <- target[, j]
    if (sum((ref - tgt)^2) > sum((ref + tgt)^2)) {
      aligned[, j] <- -tgt
    }
  }
  aligned
}

rotation_aligned <- align_columns(base_pca$rotation, big_pca$rotation)
max_rotation_error <- max(abs(rotation_aligned - base_pca$rotation))
max_sdev_error <- max(abs(big_pca$sdev - base_pca$sdev))

big_scores <- pca_scores_bigmatrix(
  xpMat = big_iris,
  rotation = big_pca$rotation,
  center = big_pca$center,
  scale = big_pca$scale,
  block_size = 128L
)

scores_aligned <- align_columns(base_pca$x, big_scores)
max_score_error <- max(abs(scores_aligned - base_pca$x))

c(
  rotation = max_rotation_error,
  sdev = max_sdev_error,
  scores = max_score_error
)
```

The maximum absolute deviations between the base implementation and **bigPCAcpp**
are negligible (on the order of numerical precision), showing that the
out-of-core algorithm faithfully reproduces the same components and scores.

## Variable relationships

The exported helpers expose common PCA diagnostics without requiring the
original data matrix in memory.

```{r diagnostics}
loadings <- pca_variable_loadings(big_pca$rotation, big_pca$sdev)
correlations <- pca_variable_correlations(
  big_pca$rotation,
  big_pca$sdev,
  big_pca$column_sd,
  big_pca$scale
)
contributions <- pca_variable_contributions(loadings)

head(loadings)
head(correlations)
head(contributions)
range(correlations)
```

Loadings match the scaled rotation matrix, correlations honour whether the PCA
was computed on standardised variables and remain bounded between -1 and 1, and
contributions report the relative importance of each variable within a
component.

## Visualising PCA results

The companion plotting helpers make it straightforward to inspect the
components returned by **bigPCAcpp**.

```{r plot-scree, fig.cap="Scree plot of variance explained by each component."}
pca_plot_scree(big_pca)
```

The scree plot summarises how much variance each component explains and makes
it easy to identify natural cutoffs.

```{r plot-scores, fig.cap="Scores for the first two principal components."}
pca_plot_scores(
  big_iris,
  rotation = big_pca$rotation,
  center = big_pca$center,
  scale = big_pca$scale,
  max_points = nrow(big_iris),
  sample = "head"
)
```

Score plots provide a quick way to compare sample relationships using the
requested principal components without materialising the full score matrix.

```{r plot-correlation, fig.cap="Correlation circle highlighting how variables align with the first two components."}
pca_plot_correlation_circle(
  correlations,
  components = c(1L, 2L)
)
```

Correlation circles visualise how each variable aligns with the chosen
components, making it easy to spot groups of features that contribute in a
similar direction.

```{r plot-biplot, fig.cap="Biplot combining sample scores and variable loadings."}
pca_plot_biplot(
  big_scores,
  loadings,
  components = c(1L, 2L)
)
```

The biplot overlays sample scores with the variable loadings, providing a joint
view of how observations cluster along the selected components and which
features drive those separations.

## Singular value decomposition helpers

The package also exposes building blocks for singular value decomposition
(SVD) that operate directly on `big.matrix` instances, which can be useful for
custom pipelines that only need the singular vectors or values.

```{r svd-example}
svd_res <- svd_bigmatrix(big_iris, nu = 2L, nv = 2L, block_size = 128L)
svd_res$d
```

The `svd_bigmatrix()` helper mirrors base R's [`svd()`][svd] but streams data
in manageable blocks, making it practical for large file-backed matrices.

[svd]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/svd.html

## Robust PCA and SVD

When data contain outliers, the robust variants supplied by **bigPCAcpp** help
stabilise the recovered components by down-weighting leverage points.

```{r robust-pca}
robust_pca <- pca_robust(iris_mat, ncomp = 4L)
robust_pca$sdev
robust_pca$robust_weights[1:10]
```

The robust PCA interface accepts regular dense matrices, computes robust
estimates of location and scale, and returns component scores along with the
per-row weights and iteration counts used by the re-weighted SVD.

The underlying solver is also exposed directly for bespoke workflows that only
need a robust singular value decomposition.

```{r robust-svd}
robust_svd <- svd_robust(iris_mat, ncomp = 3L)
robust_svd$d
robust_svd$weights[1:10]
```

The robust SVD returns singular values, left singular vectors, and the weights
assigned to each observation, which can be combined with classical SVD results
to assess the influence of individual rows.

## Next steps for larger data

For on-disk matrices created with `filebacked.big.matrix()`, pass the
descriptor pointer to `pca_bigmatrix()` and the algorithm will stream data in
blocks, keeping memory usage bounded. Component scores can likewise be
generated in batches using `pca_scores_stream_bigmatrix()`.

```{r filebacked-example}
library(bigmemory)
library(bigPCAcpp)

path <- tempfile(fileext = ".bin")
desc <- paste0(path, ".desc")

bm <- filebacked.big.matrix(
  nrow = nrow(iris_mat),
  ncol = ncol(iris_mat),
  type = "double",
  backingfile = basename(path),
  backingpath = dirname(path),
  descriptorfile = basename(desc)
)

bm[,] <- iris_mat

pca <- pca_bigmatrix(bm, center = TRUE, scale = TRUE, ncomp = 4)
scores <- filebacked.big.matrix(
  nrow = nrow(bm),
  ncol = ncol(pca$rotation),
  type = "double",
  backingfile = "scores.bin",
  backingpath = dirname(path),
  descriptorfile = "scores.desc"
)

pca_scores_stream_bigmatrix(
  bm,
  scores,
  pca$rotation,
  center = pca$center,
  scale = pca$scale
)
```

Once the scores are stored on disk, they can be sampled and plotted just like
the in-memory workflow:

```{r filebacked-plot, fig.cap="Scores streamed from a file-backed big.matrix."}
pca_plot_scores(
  bm,
  rotation = pca$rotation,
  center = pca$center,
  scale = pca$scale,
  components = c(1L, 2L),
  max_points = nrow(bm),
  sample = "head"
)
```

```{r filebacked-example-2, eval = FALSE}
library(bigmemory)
library(bigPCAcpp)

path <- tempfile(fileext = ".bin")
desc <- paste0(path, ".desc")

bm <- filebacked.big.matrix(
  nrow = 5000,
  ncol = 50,
  type = "double",
  backingfile = basename(path),
  backingpath = dirname(path),
  descriptorfile = basename(desc)
)

pca <- pca_bigmatrix(bm, center = TRUE, scale = TRUE, ncomp = 5)
scores <- filebacked.big.matrix(
  nrow = nrow(bm),
  ncol = ncol(pca$rotation),
  type = "double",
  backingfile = "scores.bin",
  backingpath = dirname(path),
  descriptorfile = "scores.desc"
)

pca_scores_stream_bigmatrix(
  bm,
  scores,
  pca$rotation,
  center = pca$center,
  scale = pca$scale
)
```

With these building blocks, **bigPCAcpp** enables analyses that match the accuracy
of in-memory PCA workflows while scaling to data sets that exceed RAM.