--- title: "Choosing an Initialization Method for Archetypal Analysis" bibliography: references.bib output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Choosing an Initialization Method for Archetypal Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5.5, fig.align = "center", warning = FALSE ) set.seed(42) par(bty = "n") library(yaap) ``` > This vignette assumes familiarity with the basic AA workflow covered in the **Introduction to Archetypal Analysis with yaap** vignette. ## Quick Start For a first Euclidean AA fit, start with `"furthest_sum"` or `"kmeans_pp"`. `"furthest_sum"` is a strong boundary-seeking default for convex data, while `"kmeans_pp"` is a softer alternative when you want some randomness without falling back to uniform sampling. If the geometry is unknown and you can afford several starts, use `"random"` with a larger `nrep`. If you only have a few starts and runtime is less important, use `"aa_pp"` because each added point is chosen against the current AA reconstruction error. ```{r quick-recommendation, eval = FALSE} library(yaap) toy <- read.csv(system.file("extdata", "toy.csv", package = "yaap")) fit <- run_aa(toy, K = 3, nrep = 5, init = "furthest_sum") fit <- run_aa(toy, K = 3, nrep = 5, init = "kmeans_pp") fit <- run_aa(toy, K = 3, nrep = 20, init = "random") fit <- run_aa(toy, K = 3, nrep = 3, init = "aa_pp") ``` ## Background Archetypal analysis (AA) minimizes a reconstruction loss, written in the standard Euclidean formulation as $$\mathcal{L}(X, SA) = \|X - S A\|_F^2.$$ As with many matrix decomposition problems, the objective is non-convex in both $S$ and $A$, and gradient descent can terminate at a **local minimum**. As a result, the quality of the final solution depends substantially on the starting point. yaap implements several options for initializing the archetype coordinates, which have been suggested in the literature, through the `aa_init()` function and the `init` argument. ## Methods at a Glance The table below summarizes the seven methods by their strategy, whether they involve random sampling, their relative speed, how strongly they depend on the observed sampling density, how strongly they impose geometric assumptions, and any mandatory extra arguments. | Method | Strategy | Stochastic? | Speed | Density sensitivity | Geometry sensitivity | Mandatory extras | | ---------------------- | ---------------------------------------------------------------------------------- | ---------------- | ------ | ------------------- | -------------------- | :--------------: | | `"random"` | Sample $K$ rows uniformly at random | Yes | Fast | High | Low | — | | `"dirichlet"` | Construct each archetype as a random convex combination of rows | Yes | Fast | High | Low | — | | `"kmeans_pp"` | Sample proportional to squared distance to the nearest current archetype | Yes | Medium | Medium | Medium | — | | `"aa_pp"` | Sample using residual reconstruction error from the current hull | Yes | Slow | Medium | Medium | — | | `"furthest_first"` | Select the point furthest from all current archetypes | Seed only | Medium | Low-medium | High | — | | `"furthest_sum"` | Select the point that maximizes the *sum* of distances to all current archetypes | Seed only | Medium | Low | High | — | | `"hull_outmost"` | Select frequently outlying convex-hull candidates | Partitioned only | Medium | Low | High | `hull_method` | Density and geometry are counterbalancing sensitivities. Uniform methods such as `random` and diffuse `dirichlet` starts are strongly affected by where the data are numerous: a sparsely sampled corner may need many restarts before it is selected. Their advantage is that independent restarts genuinely explore different parts of the empirical distribution, which can help avoid repeatedly converging to the same local minimum when many fits are affordable. More biased boundary-seeking methods such as `furthest_sum` are less affected by interior density because, once a corner is represented by at least one observed point, the greedy distance criterion can still discover it. The cost of that bias is geometric sensitivity and lower restart diversity: repeated starts often return the same, or nearly the same, boundary configuration. These methods therefore make sense when the assumed geometry is credible or when running many full AA optimizations is too expensive. Large-data approximations are controlled orthogonally with `batch_size`, `batch_type`, and `batch_replace`. By default, `batch_type = "distal"` samples candidate rows with probability proportional to squared distance from the data center, preserving the coreset idea of focusing computation on likely boundary points [@Mair2019]. Use `batch_type = "uniform"` when you want an unbiased mini-batch approximation, for example with `method = "aa_pp"`. ## The `aa_init()` Function `aa_init()` runs any initialization method as a standalone step, returning a named list with two elements: * **`A`** — $K \times M$ matrix of initial archetype coordinates. * **`B`** — $K \times N$ row-stochastic matrix expressing each archetype as a convex combination of data points ($A = B X$). ```{r aa-init-basic} toy <- read.csv(system.file("extdata", "toy.csv", package = "yaap")) X <- as.matrix(toy) # 250 × 2 K <- 3 init <- aa_init(X, K = K, method = "furthest_sum") str(init) ``` The `A` matrix provides the initial archetype coordinates that will seed the optimization loop; `B` is the convex-hull certificate for those coordinates. If `run_aa()` is initialized with `aa_init()` and a method name or a custom function, the `B` returned is carried into the algorithm initialization. If you instead pass a pre-computed coordinate matrix such as `init$A`, `run_aa()` checks that the coordinates still admit a feasible `B`. Rows outside the allowed data hull are projected back into the hull with a warning. ```{r initialization-helpers, include = FALSE} draw_simplex <- function(A, border, fill = NULL, lty = 1, lwd = 2, pch = 17, cex = 1.1) { if (nrow(A) >= 3) { hull <- chull(A) polygon(A[hull, , drop = FALSE], border = border, lwd = lwd, lty = lty, col = fill ) } else if (nrow(A) == 2) { lines(A, col = border, lwd = lwd, lty = lty) } points(A, pch = pch, cex = cex, col = border) } selection_deviation_score <- function(counts, K, n, pseudocount = 0.5) { total_counts <- sum(counts) N_runs <- total_counts / K expected <- total_counts / n p_null <- K / n log_fc <- log2((counts + pseudocount) / (expected + pseudocount)) p_lower <- pbinom(counts, size = N_runs, prob = p_null) p_upper <- pbinom(counts - 1, size = N_runs, prob = p_null, lower.tail = FALSE ) p_value <- pmin(1, 2 * pmin(p_lower, p_upper)) signed_log_p <- sign(counts - expected) * -log10(pmax(p_value, .Machine$double.xmin)) list(log_fc = log_fc, p_value = p_value, signed_log_p = signed_log_p) } plot_selected_density <- function(X, counts, main, K, color_max = NULL, pseudocount = 0.5) { score <- selection_deviation_score(counts, K, nrow(X), pseudocount) signed_log_p <- score$signed_log_p pal <- colorRamp(c("blue", "white", "red")) cmax <- color_max if (is.null(cmax)) cmax <- max(abs(signed_log_p)) probs <- if (cmax > 0) { pmax(0, pmin(1, (signed_log_p + cmax) / (2 * cmax))) } else { rep(0.5, length(signed_log_p)) } point_cols <- pal(probs) / 255 point_cols <- rgb(point_cols[, 1], point_cols[, 2], point_cols[, 3]) ix <- order(abs(signed_log_p)) plot( X[ix, ], pch = 21, cex = 1, col = "gray50", bg = point_cols[ix], main = main, xlab = "", ylab = "", axes = FALSE, asp = 1 ) } plot_data_with_points <- function(X, A, main, col = "grey75", view = c(1, 2), pch = 4, point_col = "black") { plot( X[, view, drop = FALSE], pch = 16, cex = 0.45, col = col, main = main, xlab = "", ylab = "", axes = FALSE, asp = 1 ) points(A[, view, drop = FALSE], pch = pch, cex = 1.25, lwd = 1.6, col = point_col ) } make_concentric_circles <- function(n = 180, noise = 0.03) { n_inner <- n %/% 2 n_outer <- n - n_inner theta <- c(runif(n_inner, 0, 2 * pi), runif(n_outer, 0, 2 * pi)) radius <- c(rep(0.4, n_inner), rep(1.0, n_outer)) X <- cbind(radius * cos(theta), radius * sin(theta)) X <- X + rnorm(length(X), sd = noise) colnames(X) <- c("x", "y") list( X = X, view = c(1, 2), col = c(rep("#0072B2", n_inner), rep("#D55E00", n_outer)), main = "Concentric circles", kernel = "laplace", kernel_args = list(sigma = 0.25) ) } make_swiss_roll <- function(n = 180, noise = 0.025) { t <- runif(n, 1.5 * pi, 4.5 * pi) h <- runif(n, -1, 1) X <- cbind(t * cos(t) / 7, h, t * sin(t) / 7) X <- X + rnorm(length(X), sd = noise) colnames(X) <- c("x", "height", "z") roll_cols <- colorRampPalette(c("#0072B2", "#009E73", "#D55E00"))(100) col_id <- pmax(1, pmin(100, ceiling(99 * (t - min(t)) / diff(range(t))) + 1)) list( X = X, view = c(1, 3), col = roll_cols[col_id], main = "Swiss roll", kernel = "rbf", kernel_args = list(sigma = 0.45) ) } ``` ### Supplying Your Own Initialization The built-in method names cover the common cases, but `run_aa()` also accepts either a precomputed matrix of archetype coordinates or a custom initialization function. If the starting coordinates are already available, pass them directly as a matrix in the same units as the `X`. `run_aa()` will preprocess it the same way it preprocesses `X`, handles the required checks, and constructs `B`. ```{r precomputed-init-matrix} X <- as.matrix(toy) K <- 3 tol <- 0.01 eps <- 0 A0 <- X[1:K, , drop = FALSE] fit_matrix <- run_aa(X, K = K, init = A0, scale = FALSE, tol = tol, eps = eps) ``` A custom function is useful when initialization itself needs code. The function must take `X` and `K`, and return the same two objects as `aa_init()`: the archetype coordinates `A` and the data weights `B`. Additional arguments can be passed through `init_args`. Below is a deliberately simple initializer that chooses the first `K` rows of the data. ```{r custom-init-function} X <- as.matrix(toy) K <- 3 tol <- 0.01 first_k_init <- function(X, K, ...) { A <- X[1:K, , drop = FALSE] B <- diag(1, nrow = K, ncol = nrow(X)) list(A = A, B = B) } fit_custom <- run_aa(X, K = K, init = first_k_init, scale = FALSE, tol = tol) ``` If your custom initializer naturally produces only archetype coordinates, `fit_simplex()` exposes the same simplex mapping used internally to construct a feasible `B`. Reconstructing `A_init` as `B %*% X` ensures the coordinates passed to AA are on the data hull: ```{r custom-init-fit-simplex, eval = FALSE} X <- as.matrix(toy) K <- 3 A_user <- some_initializer(X, K = K) B <- fit_simplex(A_user, X) A_init <- B %*% X init <- list(A = A_init, B = B) ``` These two forms are useful when the starting point comes from a previous analysis, a domain-specific rule, or an external clustering workflow. For most workflows, passing a method name remains the shortest form. ```{r run-aa-init-short, eval = FALSE} X <- as.matrix(toy) K <- 3 tol <- 0.01 fit <- run_aa(X, K = K, init = "furthest_sum", tol = tol) fit <- run_aa(X, K = K, init = "aa_pp", tol = tol) fit <- run_aa(X, K = K, init = "aa_pp", init_args = list(batch_size = 200, batch_type = "uniform"), tol = tol ) ``` ## Comparing Initializer Biases ```{r init-selection-density} X <- as.matrix(toy) K <- 3 method_labels <- c( random = "Random", kmeans_pp = "k-means++", aa_pp = "AA++", furthest_first = "Furthest First", furthest_sum = "Furthest Sum", hull_outmost = "Hull-outmost" ) N_expected <- 10 # Expected selections per point across all runs N_runs <- nrow(X) * N_expected # "significance" cutoff for coloring points red/blue color_cutoff <- -log10(1 / nrow(X)) selected_counts <- list() for (method in names(method_labels)) { counts <- rep(0, nrow(X)) for (run in 1:N_runs) { if (method == "hull_outmost") { init_obj <- aa_init(X, K = K, method = method, hull_method = "partitioned") } else { init_obj <- aa_init(X, K = K, method = method) } selected <- which(init_obj$B > 0, arr.ind = TRUE)[, 2] counts[selected] <- counts[selected] + 1 } selected_counts[[method]] <- counts } ``` The first practical difference between initializers is not whether they are "good" or "bad"; it is where they tend to look. The toy data below are simple: 250 two-dimensional points spread near the vertices of a triangle. The figure below repeats each row-selecting initializer many times and colors each point by how often it was selected compared with uniform random sampling. Red points are selected more often than random, blue points are selected less often, and white points are close to the random baseline. ```{r init-selection-density-plot, fig.width = 8, fig.height = 7, out.width="100%", results='hold', echo=FALSE} op <- par(mfrow = c(2, 3), mar = c(2, 2, 3, 0.5), bty = "n") for (method in names(method_labels)) { plot_selected_density( X, selected_counts[[method]], main = method_labels[method], K = K, color_max = color_cutoff ) } par(op) ``` As expected, `random` selects archetypes uniformly from the observed rows, so the points do not diverge much from their null probability of selection. This is probably not the best strategy for a convex geometry like the toy data, but it is unbiased and thus robust to unknown data structure. With enough restarts this strategy will eventually explore every region, but any single run may miss the corners. On the other extreme, methods like `furthest_first`, `furthest_sum`, and `hull_outmost` are exploring more aggressively the boundary of the data geometry. We see that most of them converge to the same set of candidate points and never pick any point in the middle of the cloud. `furthest_first` is the only one that shows some variation in the selected points, but it still has a strong corner preference. For a simple convex shape that bias is often useful: the corners are exactly where good archetypes should start. Clustering-inspired methods such as `kmeans_pp` and `aa_pp` sit between these two extremes. These methods adapt ideas developed to initialize clustering algorithms to the AA setting. They are more probabilistic in nature and thus represent softer versions of their boundary-searching counterparts. They also rarely select any point in the center of the cloud, but they explore boundary regions more diffusely and do not explicitly maximize extremeness. More details about the methodology employed by each method can be found in the [Method Reference](#method-reference) section at the end of this vignette. The rest of the vignette narrows to three representatives: `random` as an exploratory baseline, `kmeans_pp` as the softer outward-biased initializer, and `furthest_sum` for the strong greedy boundary bias. ## Increasing the Number of Archetypes With $K = 3$, the toy data roughly match the three-corner shape. When $K$ increases, the question changes from "did we find the corners?" to "how do the extra archetypes cover the cloud?" The following chunk is the user-facing call: only the `method` and `K` values change. ```{r increasing-k} X <- as.matrix(toy) k_grid <- c(3, 7, 15) k_methods <- c("random", "kmeans_pp", "furthest_sum") k_inits <- list() for (method in k_methods) { method_inits <- list() for (K in k_grid) { k_lab <- paste("K =", K) method_inits[[k_lab]] <- aa_init(X, K = K, method = method) } k_inits[[method]] <- method_inits } ``` ```{r increasing-k-plot, fig.width = 8.5, fig.height = 6, out.width="100%", results='hold', echo=FALSE} op <- par( mfrow = c(length(k_methods), length(k_grid)), mar = c(0.5, 0.5, 2.2, 0.2), oma = c(0, 0, 0, 0), bty = "n" ) for (method in k_methods) { for (k_label in names(k_inits[[method]])) { init_obj <- k_inits[[method]][[k_label]] plot( X, pch = 16, cex = 0.42, col = "grey75", main = sprintf("%s\n%s", method_labels[method], k_label), xlab = "", ylab = "", axes = FALSE, asp = 1 ) draw_simplex( init_obj$A, border = "firebrick3", fill = adjustcolor("firebrick3", 0.05), cex = 0.95 ) } } par(op) ``` As $K$ grows, the non-aggressive methods start to waste extra archetypes in the center of the convex hull. This is visible for `random`, and eventually also for `kmeans_pp`: once the main corners have been sampled, additional points are often placed in the interior regions where there are definitely no archetypes for this triangle. Those starts can still optimize successfully, but the initialization budget is no longer being used to approximate the boundary. `furthest_sum` behaves differently. It keeps adding points on the outside of the cloud and, as $K$ increases, the selected polygon becomes a finer approximation to the true convex hull boundary. In fact, points generated by `furthest_sum` have a useful geometric property: they lie in the minimal convex set of the unselected data points [Theorem 2 in [@Morup2012]]. ## When Initialization Matters The toy triangle is intentionally simple. If we start from one initialization and then let `run_aa()` optimize, most methods reach essentially the same final fit. The red dashed triangle below is the starting point; the blue solid triangle is the fitted solution. ```{r init-before-after-fit} X <- as.matrix(toy) K <- 3 focused_methods <- c("random", "kmeans_pp", "furthest_sum") fits <- list() for (method in focused_methods) { fits[[method]] <- run_aa( x = X, K = K, init = method, scale = FALSE, max_iter = 60, tol = 0.01 ) } ``` ```{r init-before-after-plot, echo=FALSE, fig.width = 8.5, fig.height = 6, out.width="100%", results='hold'} loss_values <- c() for (method in focused_methods) { loss_values <- c(loss_values, fits[[method]]$loss$loss) } loss_ylim <- range(loss_values) op <- par(mfrow = c(2, 3), bty = "n") par(mar = c(0.8, 0.6, 3, 0.5)) for (method in focused_methods) { fit <- fits[[method]] plot(fit, what = "coordinates", show_anames = FALSE, main = sprintf("%s\n%d steps", method_labels[method], nrow(fit$loss) - 1), args.data.scatter = list(pch = 16, cex = 0.45, col = "grey75"), pch = 19, col = "steelblue4", axes = FALSE, asp = 1 ) draw_simplex( fit$init, border = adjustcolor("firebrick3", 0.75), fill = adjustcolor("firebrick3", 0.05), lty = 2 ) } par(mar = c(2.8, 4.2, 3, 0.5)) for (method in focused_methods) { plot(fits[[method]], what = "loss", lwd = 1.8, col = "steelblue4", ylim = loss_ylim, main = "Loss", ylab = "", las = 1 ) } par(op) ``` This is the reassuring case: the data geometry is simple enough that a mediocre start can still be corrected by the optimizer. Initialization matters more when the geometry is less well described by straight-line Euclidean distances. ## Nonlinear Geometries The previous examples still live in a simple Euclidean geometry: distances in the plotted coordinates describe the structure we care about. Some datasets are not like that. In concentric circles, the inner circle is important even though the outer circle contains the most distant points. In a Swiss roll, points that look close in a two-dimensional projection may be far apart along the rolled surface. ```{r nonlinear-init} K <- 8 nonlinear_data <- list( circles = make_concentric_circles(), swiss_roll = make_swiss_roll() ) nonlinear_methods <- c("random", "kmeans_pp", "furthest_sum") nonlinear_inits <- list() for (data_shape in names(nonlinear_data)) { data_obj <- nonlinear_data[[data_shape]] out <- list() for (method in nonlinear_methods) { out[[method]] <- aa_init(data_obj$X, K = K, method = method) } nonlinear_inits[[data_shape]] <- out } ``` ```{r nonlinear-init-plot, fig.width = 8.5, fig.height = 6, out.width="100%", results='hold', echo=FALSE} op <- par( mfrow = c(length(nonlinear_data), length(nonlinear_methods)), mar = c(2, 2, 3, 0.5), bty = "n" ) for (data_shape in names(nonlinear_data)) { data_obj <- nonlinear_data[[data_shape]] for (method in nonlinear_methods) { init_obj <- nonlinear_inits[[data_shape]][[method]] plot_data_with_points( data_obj$X, init_obj$A, main = sprintf("%s\n%s", data_obj$main, method_labels[method]), col = adjustcolor(data_obj$col, 0.62), view = data_obj$view ) } } par(op) ``` First inspect the Euclidean initializations. The biased methods are doing what they were designed to do: they seek far-away points. On these shapes, that can over-emphasize the outside and leave the center or inner structure under-represented. `random`, on the other hand, follows the observed sampling density rather than the outer geometry. On these uniformly sampled shapes, that keeps it from concentrating only on the outside, but it also means performance would change if important regions were sampled sparsely. For these cases, the better answer is often to change the space in which AA is fit. A kernel method lets nearby points influence each other according to a smooth similarity rule rather than only through straight-line distance in the input coordinates. ```{r kernel-user-code, eval=FALSE} fit_kernel <- archetypes_kernel_pgd( x = X, K = K, kernel = "laplace", init = "furthest_sum", tol = 0.01 ) ``` ```{r nonlinear-kernel-setup, include = FALSE, warning=FALSE} kernel_methods <- c("furthest_sum", "kmeans_pp") kernel_inits <- list() for (data_shape in names(nonlinear_data)) { data_obj <- nonlinear_data[[data_shape]] out <- list() for (method in kernel_methods) { kernel_fit <- archetypes_kernel_pgd( data_obj$X, K = K, init = method, kernel = data_obj$kernel, kernel_args = data_obj$kernel_args, max_iter = 1, tol = 0.01, tol_r2 = 1 ) out[[method]] <- list( euclidean = nonlinear_inits[[data_shape]][[method]]$A, kernel = kernel_fit$init %*% data_obj$X ) } kernel_inits[[data_shape]] <- out } ``` ```{r nonlinear-kernel-plot, fig.width = 8.5, fig.height = 6, out.width="100%", results='hold', echo=FALSE} op <- par( mfrow = c(length(nonlinear_data), length(kernel_methods)), mar = c(2, 2, 3, 0.5), bty = "n" ) for (data_shape in names(nonlinear_data)) { data_obj <- nonlinear_data[[data_shape]] for (method in kernel_methods) { init_obj <- kernel_inits[[data_shape]][[method]] plot( data_obj$X[, data_obj$view, drop = FALSE], pch = 16, cex = 0.45, col = adjustcolor(data_obj$col, 0.55), main = sprintf("%s\n%s", data_obj$main, method_labels[method]), xlab = "", ylab = "", axes = FALSE, asp = 1 ) points(init_obj$euclidean[, data_obj$view, drop = FALSE], pch = 4, cex = 1.2, lwd = 1.5, col = "black" ) points(init_obj$kernel[, data_obj$view, drop = FALSE], pch = 1, cex = 1.25, lwd = 1.6, col = "#D55E00" ) } } par(fig = c(0, 1, 0, 1), mar = c(0, 0, 0, 0), new = TRUE) plot.new() legend( "center", legend = c("Euclidean init", "Kernel init"), pch = c(4, 1), col = c("black", "#D55E00"), pt.cex = c(2.1, 2.2), bty = "n", cex = 1.2 ) par(op) ``` The same initialization names still work. The important difference is that the kernel version measures spread in the similarity space used by the model. This can rescue boundary-seeking methods on shapes where the raw coordinates make the outside look more important than the rest of the structure. ## Conclusions and Recommendations There is no universal best initializer. The right choice depends on how much you know about the geometry, whether useful regions are sampled evenly, and how many full optimization restarts you can afford. | Situation | Recommended `init` | | ---------------------------------------------- | ----------------------------------------------------------------------------| | Simple convex-ish data | `"furthest_sum"`, `"kmeans_pp"` | | Uneven sampling density, corners represented | `"furthest_sum"`, `"furthest_first"`, `"hull_outmost"` | | Unknown structure, enough restarts | `"random"` or `"kmeans_pp"` with a larger `nrep` | | Unknown structure, few restarts | `"aa_pp"` | | Very large dataset, speed critical | `"furthest_sum"` with `batch_size` or `"hull_outmost"` partitioned | | Reproducing legacy results | `"furthest_sum"` (PCHA), `"random"` [archetypes], `"dirichlet"` directional | Choose along two main axes: how much you trust the geometry, and how much restart diversity you can afford. Strongly biased initializers such as `furthest_sum` and `hull_outmost` are useful when the convex geometry is trusted or full optimization runs are expensive, but repeated starts often return the same boundary configuration. `random` is the opposite extreme: it makes the fewest geometric assumptions and explores different basins across restarts, but it follows the empirical sampling density, so sparse corners may require a larger `nrep`. `kmeans_pp` and `aa_pp` are the practical middle ground. They keep stochastic variation across restarts while biasing the draw toward points that improve coverage or reconstruction. Use them when you want a better first guess than `random` without committing as strongly to one boundary-seeking geometry. If an extreme corner is absent from the sample entirely, no row-selecting initializer can recover it without changing the model, the data collection, or the candidate set. If the geometry is visibly nonlinear, changing the model is usually more important than fine-tuning the Euclidean initializer. Use `archetypes_kernel_pgd()` with a suitable kernel, then choose one of the kernel-compatible initializers such as `"kmeans_pp"` or `"furthest_sum"`. For memory efficiency and speed, any initializer can be restricted to a candidate batch with `batch_size`. The default `batch_type = "distal"` samples candidate rows in proportion to their squared distance from the data center, which is the coreset idea of @Mair2019: if archetypes are expected near the boundary, it is wasteful to spend most candidate memory on central points. Use `batch_type = "uniform"` when the goal is an unbiased mini-batch approximation, especially for `method = "aa_pp"`. ### Initialization Caveats {#initialization-caveats} The default `scale = FALSE` in `run_aa()` preserves the data on its supplied feature scale. Setting `scale = TRUE` applies column-wise standardization (z-scoring) before fitting; this often improves the conditioning of Gaussian Euclidean fits, but it interacts differently with each initialization strategy. @Mair2023 show that FurthestSum is sensitive to preprocessing: standardization can distort the Euclidean distances it uses to choose boundary points. AA++ is less affected because it samples from residuals tied directly to the AA objective [@Mair2023]. The cost of this objective-aware sampling is runtime. Among the built-in initializers, `"aa_pp"` is usually the slowest because after the first two points it solves a small AA projection problem at each step, essentially a greedy AA analysis. In return, each added point is expected not to increase the reconstruction cost, and the expected cost decreases whenever the new point expands the current hull. Initializer names are not supported identically by every model family. The Euclidean solvers (`method = "pgd"` and `"nnls"`) call `aa_init()` on the preprocessed data, so they accept the full catalogue of initializer strings described above, as well as custom initializer functions and coordinate matrices. PAA uses family-specific parameter profiles and does not support the Euclidean `scale` argument. Kernel AA is the main exception. With `method = "kernel"`, initializer strings are limited to `"random"`, `"dirichlet"`, `"furthest_first"`, `"kmeans_pp"`, and `"furthest_sum"`. Row-selection methods can be expressed using selected row indices and pairwise distances from the kernel Gram matrix; `"dirichlet"` is also available because it constructs the coefficient matrix $B$ directly and does not require input-space archetype coordinates. Candidate batching is supported for these kernel initializers, with distal batches computed from kernel-space center distances. The remaining strings are not accepted as kernel initializers: `"aa_pp"` requires convex-hull residual projections and `"hull_outmost"` requires explicit hull candidates. Those operations would need separate kernel-space implementations. If you need one of those starts for a kernel fit, pass an explicit `K x n` non-negative coefficient matrix instead; row indices or row names are also accepted when you want to select observed rows directly. The directional solver also accepts the `aa_init()` method strings, but it defaults to `"dirichlet"` for historical reasons. ## Method Reference {#method-reference} ### `random` Selects $K$ rows of $X$ uniformly at random without replacement. ```{r uniform-ex, eval = FALSE} X <- as.matrix(toy) K <- 3 init <- aa_init(X, K = K, method = "random") ``` Uniform sampling is the simplest possible initialization and requires no distance computations. By design it explores the whole geometry with equal probability, so it makes few assumptions about where useful archetypes should lie. When combined with many random restarts via the `nrep` argument in `run_aa()`, even simple uniform initialization can find good solutions as pointed out in [@Krohne2019]. ### `dirichlet` Instead of sampling archetypes from the data, `dirichlet` constructs the coefficients matrix $B$ directly by sampling each row from a symmetric Dirichlet distribution $\text{Dirichlet}(\alpha \mathbf{1}_N)$. Thus the resulting archetypes are random convex combinations of **all** data points. The `alpha` parameter controls the shape of the distribution: * **`alpha = 1`** (default): all possible combinations are equally likely * **`alpha < 1`**: sparse combinations are more likely, so only a few points contribute to each archetype. * **`alpha > 1`**: near uniform combinations are more likely, so many points contribute to each archetype. As $\alpha \to 0$ the behavior approaches `random` (archetypes snap to data points), while for $\alpha \gg 1$ the archetypes are more likely to lie near the center of the data hull. The plot below compares initial archetype placements across three variants to demonstrate the convergence of `dirichlet` to `random`. ```{r dirichlet-vs-random-plot, fig.height = 3, fig.width=7, results='hold', echo=FALSE} X <- as.matrix(toy) K <- 3 dir_variants <- list( `random` = list(method = "random"), `dirichlet alpha = 1` = list(method = "dirichlet", alpha = 1), `dirichlet alpha = 0.01` = list(method = "dirichlet", alpha = 0.01) ) n_dir_runs <- 3 dir_inits <- list() for (variant in names(dir_variants)) { args <- dir_variants[[variant]] dir_inits[[variant]] <- list() for (run in 1:n_dir_runs) { if (args$method == "dirichlet") { dir_inits[[variant]][[run]] <- aa_init( X, K = K, method = args$method, alpha = args$alpha ) } else { dir_inits[[variant]][[run]] <- aa_init( X, K = K, method = args$method ) } } } op <- par(mfrow = c(1, 3), mar = c(2, 2, 2.5, 0.5), bty = "n") run_cols <- adjustcolor(c("firebrick3", "darkorange3", "purple4"), 0.7) for (nm in names(dir_inits)) { plot( X, pch = 16, cex = 0.45, col = "grey75", main = nm, xlab = "", ylab = "", xlim = range(X[, 1]) + c(-1, 1), ylim = range(X[, 2]) + c(-1, 1), axes = FALSE ) for (run_ix in 1:length(dir_inits[[nm]])) { draw_simplex( dir_inits[[nm]][[run_ix]]$A, border = run_cols[run_ix], fill = adjustcolor(run_cols[run_ix], 0.06), lty = run_ix, cex = 1.1 ) } if (nm == names(dir_inits)[1]) { legend( "topleft", legend = paste("Run", 1:n_dir_runs), col = run_cols, lty = 1:n_dir_runs, lwd = 2, pch = 17, bty = "n", cex = 0.7 ) } } par(op) ``` `"random"` archetypes (left) always coincide with data points and cluster near high-density regions. `"dirichlet"` with $\alpha = 1$ (center) samples weights from all data points and often produces archetypes near the center of the hull. Lowering $\alpha$ (right) makes weights sparse, which pushes the random mixtures toward specific data points and toward the behavior of `"random"`. ### Furthest First (`"furthest_first"`) Selects the first archetype at random, with bias toward points far from the mean, and then iteratively adds the single point that is *furthest* from the already-selected set (the nearest-archetype distance to the set is maximized). ```{r furthest-first-ex, eval = FALSE} X <- as.matrix(toy) K <- 3 init <- aa_init(X, K = K, method = "furthest_first") ``` Furthest First is a greedy adaptation of the $k$-center algorithm. It produces well-separated archetypes and is deterministic after the first random draw. Because it targets the globally farthest point at each step, it tends to spread archetypes evenly across the data hull, though this spread is not specifically tuned to the AA reconstruction objective. ### k-means++ (`"kmeans_pp"`) Selects the first archetype at random, with bias toward points far from the mean, and then samples each subsequent archetype with probability proportional to the *squared* distance to the nearest already-selected archetype. ```{r kmeans-pp-ex, eval = FALSE} X <- as.matrix(toy) K <- 3 init <- aa_init(X, K = K, method = "kmeans_pp") ``` The k-means++ sampling rule is a softer version of *Furthest First*. Instead of always picking the single farthest point, it gives more chances to points that are far from the already-selected archetypes. It can also be thought of as an approximation of *AA++* which uses the squared distance to the nearest archetype as a proxy for the distance to the convex hull of the already-selected archetypes (distance to corners instead of distance to faces). ### Furthest Sum (`"furthest_sum"`) Selects archetypes by maximizing the *sum* of distances to all current archetypes [@Morup2012]. An optional refinement phase, controlled by `refinement_steps` (default 10), cycles through the selected set and replaces each archetype with a better candidate to remove the effect of the initial random selection. ```{r furthest-sum-ex, eval = FALSE} X <- as.matrix(toy) K <- 3 # Default refinement init <- aa_init(X, K = K, method = "furthest_sum") # More aggressive refinement init <- aa_init(X, K = K, method = "furthest_sum", refinement_steps = 30) # No refinement init <- aa_init(X, K = K, method = "furthest_sum", refinement_steps = 0) ``` FurthestSum is the historical default initialization for AA [@Morup2012] and remains the default when `init = NULL` is passed to `run_aa()`. ### Batched Coreset-Style Initialization Supplying `batch_size` restricts initialization to a candidate batch before selecting archetypes. With the default `batch_type = "distal"`, candidates are sampled with probability proportional to each point's squared distance from the data mean. This is the coreset idea of [@Mair2019]: reduce the effective problem size while preserving the extremal structure that boundary-seeking methods such as `furthest_sum` rely on. ```{r coreset-ex, eval = FALSE} X <- as.matrix(toy) K <- 3 batch_size <- 60 # batch_size must be at least K init <- aa_init(X, K = K, method = "furthest_sum", batch_size = batch_size) ``` For large datasets where computing all pairwise distances is expensive, the batch reduces memory use and runtime. Because distal batching downweights the center, it avoids spending most candidate slots in regions where archetypes are unlikely to be useful. ### AA++ (`"aa_pp"`) AA++ [@Mair2023] selects each archetype by sampling with probability proportional to the *squared residual distance from the convex hull* of the already-selected archetypes. Concretely, after selecting $k - 1$ archetypes it solves a small NNLS projection to find, for each data point, its best approximation as a convex combination of the current archetypes; the squared reconstruction error becomes the sampling weight for the $k$-th archetype. ```{r aa-pp-ex, eval = FALSE} X <- as.matrix(toy) K <- 3 init <- aa_init(X, K = K, method = "aa_pp") ``` Because the sampling probability is tied to the AA objective itself rather than to a surrogate (such as nearest-archetype distance), AA++ has a formal guarantee: each new archetype decreases the expected objective function^[Proposition 3.2 of [@Mair2023]]. Compared to the other initialization methods, AA++ is more computationally intensive because it requires solving $K-2$ NNLS problems (the first two archetypes are selected via `kmeans_pp`) in order to compute the sampling probabilities for each subsequent archetype. To mitigate this cost, AA++ can be used with `batch_size` to give a Monte Carlo approximation of the exact method. See the next section for details. ### Batched AA++ AA++ can also use `batch_size`, giving a variant of the Monte Carlo AA++ approximation. Instead of projecting the full dataset at each step, it draws a candidate batch, projects only those rows, and samples the next archetype from their residuals. ```{r aa-pp-mc-ex, eval = FALSE} X <- as.matrix(toy) K <- 3 batch_size <- 100 batch_type <- "uniform" # Larger batch_size -> closer approximation to exact AA++ init <- aa_init(X, K = K, method = "aa_pp", batch_size = batch_size, batch_type = batch_type ) ``` The `batch_size` parameter trades approximation accuracy for speed and must satisfy `batch_size >= K`. When `batch_size == nrow(X)` the method reduces to exact AA++. When `batch_size << nrow(X)` only a small fraction of the data is evaluated at each step, making batched AA++ useful for large datasets where exact AA++ becomes memory-bound. Use `batch_type = "uniform"` for the least biased approximation, or keep the default `batch_type = "distal"` if you want the candidate batches to emphasize likely boundary points. ### Hull-outmost (`"hull_outmost"`) Builds a pool of candidate hull vertices, then ranks them by an outmost-vote criterion: each candidate votes for the vertex farthest from it, and the $K$ most-voted candidates are selected as archetypes. The `hull_method` argument (required) controls how the candidate pool is constructed: * **`"full"`** — exact convex hull of $X$ in the original feature space. Uses `grDevices::chull` for 2-D data and the `geometry` package for higher dimensions. * **`"projected"`** — union of convex hulls computed across random low-dimensional projections of $X$. Controlled by `projected_dim` (default `2`) and `n_projection_max`. * **`"partitioned"`** — union of hull vertices from random partitions of the data rows. Controlled by `n_partitions` (default `10`). The fastest strategy. ```{r hull-outmost-ex, eval = FALSE} X <- as.matrix(toy) K <- 3 # Full hull — exact but requires 'geometry' for D > 2 init <- aa_init(X, K = K, method = "hull_outmost", hull_method = "full") # Projected hull — works in any dimension without extra packages init <- aa_init(X, K = K, method = "hull_outmost", hull_method = "projected", projected_dim = 2 ) # Partitioned hull — fastest, suitable for large n init <- aa_init(X, K = K, method = "hull_outmost", hull_method = "partitioned", n_partitions = 15 ) ``` The partitioned and projected strategies are designed for datasets where computing the exact hull is intractable. The `use_unique_candidates` argument (default `FALSE`) controls whether duplicate hull candidates (a point appearing in multiple projections or partitions) cast only one vote. ## Session Information ```{r session-info, echo = FALSE} sessionInfo() ``` ## References