--- title: "Introduction to Archetypal Analysis with yaap" bibliography: references.bib output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Introduction to Archetypal Analysis with yaap} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 5, fig.align = "center" ) set.seed(42) ``` ## What is Archetypal Analysis? Archetypal analysis (AA) is an unsupervised learning method that describes every observation as a *convex mixture*^[weighted sums with non-negative weights summing to 1] of a small number of extreme, idealized profiles called **archetypes** [@Cutler1994]. Unlike PCA, whose components are abstract mathematical directions, or k-means, whose clusters are represented by centroids, archetypes are located at the *boundary* of the data representing the distinct extremes. AA combines cluster-style interpretability with matrix-factorization flexibility. In practice, archetypes often correspond to "pure" or "extreme" profiles. This representation is useful when observations are naturally viewed as mixtures of biologically, chemically, or behaviorally distinct states. Formally, given a data matrix $X \in \mathbb{R}^{N \times M}$ (N samples, M features), AA finds: - **Archetypes** $A \in \mathbb{R}^{K \times M}$ : K extreme profiles, each of which is itself a weighted average of actual data points ($A = BX$, with $B \geq 0$, $B\mathbf{1} = \mathbf{1}$), - **Compositions** $S \in \mathbb{R}^{N \times K}$ : for each observation, a non-negative weight vector summing to 1 ($S \geq 0$, $S\mathbf{1} = \mathbf{1}$), such that the reconstruction $\hat{X} = SA$ is as close as possible to $X$ in the standard squared Frobenius norm^[sum of squared residuals]: $$\min_{B,S} \|X - SBX\|_F^2 \quad \text{s.t.} \quad A = BX,\; B,S \geq 0,\; B\mathbf{1} = S\mathbf{1} = \mathbf{1}.$$ The constraint $A = BX$ keeps archetypes *inside* (or on the boundary of) the convex hull of the data, making them directly interpretable as extreme prototypes. The composition vectors $S$ correspond to points in the archetype simplex: a composition of $(1,0,0)$ means a sample is entirely characterized by archetype 1, while $(1/3,1/3,1/3)$ means it sits equally between all three. ### Relationship to other methods |Method |Main Focus |Sample Representation |Component Type |Interpretability |Flexibility | |:-------------------|:--------------------------------------|:-----------------------------------------|:----------------------------|:---------------------|:----------------| |PCA / SVD |Directions of maximal variance |Linear combination |Abstract latent directions |Often moderate to low |High | |NMF |Additive parts-based representation |Non-negative combination |Non-negative latent parts |Moderate to high |Moderate to high | |k-means |Central cluster prototypes |Hard assignment to one cluster |Cluster centroids |High |Low to moderate | |k-medoids |Representative observed prototypes |Hard assignment to one observed prototype |Observed data points |High |Low | |Archetypal Analysis |Extreme profiles / corners of the data |Convex combination of archetypes |Observed-data-based extremes |High |Moderate to high | --- ## The yaap package **yaap** (Yet Another Archetypes Package) provides fast, flexible AA for `R`. This vignette is organized into four parts: 1. **Core workflow**: fit models with `run_aa()`, inspect `archetypes` objects, visualize archetypes and compositions, and predict for new data. 2. **Practical choices**: choose `K`, scale features, use multiple restarts, and check convergence. 3. **Alternative formulations**: use the `delta` relaxation, robust fitting, and missing-data fitting. 4. **Algorithm details**: compare NNLS and PGD, and review the main tuning arguments. ```{r library, warning=FALSE} library(yaap) library(generics) ``` ## A first example: the toy dataset The package ships a small two-dimensional `toy` dataset (originally from the [`archetypes`](https://cran.r-project.org/package=archetypes)) that contains 250 points scattered around three vertices of a triangle. ```{r load-toy} toy <- read.csv(system.file("extdata", "toy.csv", package = "yaap")) head(toy, 3) dim(toy) ``` ```{r plot-toy, fig.cap = "The toy dataset: three clusters near the corners of a triangle."} plot(toy, asp = 1., frame.plot = FALSE, axes = FALSE, main = "Toy Dataset", xlab = "", ylab = "", pch = 19, cex = 0.6, col = "gray50") ``` We fit an AA model with $K = 3$ archetypes using `run_aa()`. ```{r fit-toy} fit <- run_aa(toy, K = 3, nrep = 3) fit ``` The output reports the final $R^2$ and convergence status. ### The `archetypes` result object `run_aa()` returns an S3 object of class `archetypes`. It is a list with the following main components: | Component | Dimensions | Description | |----------------|------------|---------------------------------------------------------| | `coordinates` | K x M | Archetype profiles in feature space | | `coefficients` | K x N | Weights expressing each archetype as a mix of samples | | `compositions` | N x K | Barycentric coordinates of each sample | | `loss` | (iter) x R | Per-iteration RSS, $R^2$, and method-specific QC | | `converged` | scalar | Whether the optimizer converged | ```{r inspect} coordinates(fit) ``` The fundamental identity $\hat{X} = SA$ is: ```{r reconstruction, eval=FALSE} X_hat <- compositions(fit) %*% coordinates(fit) ``` Let us verify this recovers the data well: ```{r identity-check, fig.cap="Reconstruction of toy dataset. Based on figure from `archetypes` package vignette."} X <- as.matrix(toy) X_hat <- fitted(fit) R <- resid(fit) # residuals: X - X_hat (rss <- norm(R, "F")^2) # residual sum of squares 1 - rss / norm(X, "F")^2 # R^2 plot(X, type = "n", asp = 1., frame.plot = FALSE, axes = FALSE, main = "Toy Dataset Reconstruction", xlab = "", ylab = "") ix <- rowSums(R > 0.5) > 0 # highlight large residuals segments(X[, 1], X[, 2], X_hat[, 1], X_hat[, 2], col = "firebrick", lwd = 0.5) points(X , pch = 20, cex = 0.8, col = "gray50") points(X_hat, pch = 1, cex = 0.7, col = "firebrick") legend("topleft", legend = c("Data", "Reconstruction", "Residuals"), bty = "n", col = c("gray50", "firebrick", "firebrick"), pch = c(20, 1, NA), lwd = c(NA, NA, 0.5)) ``` In the reconstruction plot, data points (gray disks) inside the archetype convex hull have exact reconstructions (red circles). Points outside the hull are projected to its edges (red segments). The `$loss` data frame records the per-iteration optimization trajectory and method-specific diagnostics. These columns help assess convergence and model quality. The exact columns depend on the method, but typical columns include: - `loss`: working objective (e.g., RSS), - `r2`: proportion of variance explained (based on `loss`) - `rloss`: running objective for methods that do not always improve the objective - `k_S`: condition number of the composition matrix (high $\to$ near-degenerate compositions) for methods that involve solving linear systems with it. - `k_A`: condition number of the coordinate (similar logic to `k_S`) Note that `loss` values may differ from the RSS computed directly above because the working objective operates on internally scaled data. See *Algorithm details* for a full discussion. ### Integration with `broom` `yaap` also supports broom-style methods for fitted `archetypes` objects. These methods return ordinary tibbles for downstream inspection, plotting, and reporting. `tidy()` returns model matrices in long form. By default it returns the archetype coordinates: one row per archetype-feature pair, with columns `archetype`, `term`, and `value`. ```{r broom-tidy-coordinates} tidy(fit) ``` The same verb can return other fitted matrices. For example, `matrix = "compositions"` returns one row per sample-archetype pair; `value` is the fitted weight of that archetype in that sample. `matrix = "coefficients"` similarly returns the weights that express each archetype as a mixture of training samples. ```{r broom-tidy-compositions} fit |> tidy(matrix = "compositions") |> head() ``` `glance()` returns one row of model-level diagnostics, including the number of archetypes, convergence status, final loss, final $R^2$, number of iterations, adapted AIC, and model family. ```{r broom-glance} glance(fit) ``` `augment()` returns the original data with one dot-prefixed composition column per archetype. The `.A1`, `.A2`, and `.A3` columns are the fitted barycentric weights for each sample, so they sum to one row by row. ```{r broom-augment} aug <- augment(fit, data = toy) head(aug) comp_cols <- paste0(".", anames(fit)) range(rowSums(as.matrix(aug[, comp_cols]))) # all should be 1 ``` ### Visualizing the fit `plot()` method for `archetypes` objects offers several `what` options. #### Archetype positions in feature space This method plots the original data points, overlays the archetypes from the `coordinates` matrix, and connects them with lines. ```{r plot-coordinates, fig.cap = "Data and estimated archetypes. The triangle connects the three archetypes."} plot(fit, what = "coordinates", main = "Archetype positions in feature space", xlab = "", ylab = "", asp = 1, frame.plot = FALSE, axes = FALSE, pch = 17, cex = 1.5, col = "firebrick", args.data.scatter = list(pch = 19, cex = 0.5, col = "gray50")) ``` As expected, the archetypes lie close to the corners of the point cloud. Plot functions invisibly return the data used to construct the figure in a `ggplot2` friendly format. Use `plot = FALSE` to prepare those data without drawing. For example, the following code reproduces the previous figure: ```{r ggplot-example, eval=FALSE} library(ggplot2) plot(fit, what = "coordinates", plot = FALSE) |> ggplot(aes(x, y)) + geom_point(aes(color = archetype)) ``` ## Working with "real" data: Fisher's iris Besides tabular data (like `data.frame` or `matrix`), `run_aa()` also accepts **formula** and **data frame** inputs. Formula input is useful when the data contains a response column to exclude: the response is silently dropped and all right-hand-side predictors are used as features. ```{r fit-iris} # 'Species' (the response) is discarded; the four measurements are used. fit_iris <- run_aa(Species ~ ., data = iris, K = 3, scale = TRUE, tol = 1e-4) fit_iris ``` This is equivalent to calling `run_aa(iris[, 1:4], ...)` Because the four measurements are on different scales, we set `scale = TRUE` to apply column-wise z-scoring before fitting. In the following subsections, we analyze the fitted model in a typical workflow: - We first locate the archetypes in feature space using a PCA projection. - We then identify them from the observations that define them and characterize what they represent through a feature-profile plot. - We explore how the 150 samples distribute across the archetype simplex. - We finally show how the fitted model predicts compositions for new observations. ### Archetype positions in feature space Since the iris dataset has four features, the coordinate scatter requires dimensionality reduction. Pass `projection = "pca"` to project both data and archetypes onto the first two principal components. ```{r iris-coordinates, fig.cap = "PCA projection of the iris data and its three archetypes (triangles)."} pal_sp <- c(setosa = "#E69F00", versicolor = "#56B4E9", virginica = "#009E73") iris_col <- pal_sp[as.character(iris$Species)] plot(fit_iris, what = "coordinates", projection = "pca", frame.plot = FALSE, main = "PCA projection of iris data and archetypes", col = "firebrick", pch = 17, cex = 1.5, args.data.scatter = list(col = iris_col, pch = 19, cex = 0.6)) ``` ### Naming archetypes from their defining observations Each archetype is a convex mixture of actual data points, described by the `coefficients` matrix (K x N). To name the archetypes, we summarize which species contribute most to each archetype. ```{r iris-species-archetypes-match} B <- coef(fit_iris) # K x N # Sum coefficients within each species (rows = species, cols = archetypes) coef_by_species <- aggregate(t(B), by = iris["Species"], FUN = sum) knitr::kable(coef_by_species, digits = 3) ``` The table indicates that one archetype combines "setosa" and "versicolor", while the other two are predominantly "setosa" and "virginica", respectively. The assigned labels reflect those species contributions. ```{r iris-rename-archetypes} # Concat all species contributing to an archetype coef_mat <- as.matrix(coef_by_species[, -1]) # drop species column rownames(coef_mat) <- coef_by_species$Species species_contrib <- which(coef_mat > 0.1, arr.ind = TRUE) species_contrib <- aggregate( row ~ col, data = species_contrib, FUN = function(ix) { species <- rownames(coef_mat)[ix] species <- substr(species, 1, 3) # abbreviate species names paste(species, collapse = ".") } ) anames(fit_iris) <- species_contrib$row anames(fit_iris) ``` ```{r iris-palette, include = FALSE} pal_aa <- c(set = "#E69F00", set.ver = "#E7E300", vir = "#009E73") pal_aa <- pal_aa[anames(fit_iris)] # reorder to match archetype order ``` ### Archetype feature profiles `what = "profiles"` shows a grouped barplot of the archetype coordinates on the original feature scale. All other arguments are passed to `barplot()`. ```{r iris-profiles, fig.cap = "Archetype feature profiles. Each group of bars is one feature; colors distinguish archetypes."} plot(fit_iris, what = "profiles", horiz = TRUE, names.arg = sub("\\.", "\n", colnames(iris[, 1:4])), las = 1, col = pal_aa, xlab = "Feature value (original scale)", ylab = "", main = "Archetype feature profiles") ``` ### Sample compositions Named archetypes make the composition space easier to interpret. ```{r iris-simplex, fig.cap = "Ternary plot of iris compositions, colored by species. Each point is a sample placed by its three archetype weights."} plot(fit_iris, what = "simplex", col = pal_sp[as.character(iris$Species)], pch = 19, pca = TRUE, col.pca = "black", robust = FALSE) legend("topright", legend = names(pal_sp), col = pal_sp, pch = 19, bty = "n") ``` The ternary structure shows a clear separation of species in composition space: setosa concentrates near one vertex, virginica near another, and versicolor occupies intermediate mixtures along the lower edge (small setosa contribution). The black curved line is the PCA trajectory projected into the simplex, summarizing the dominant continuous transition from setosa-like to virginica-like compositions through the hybrid archetype of setosa+versicolor. The ternary plot also shows why the hybrid archetype has a stronger setosa contribution even though more versicolor samples lie near it on average: AA searches for extreme observations near the data hull boundary, where one setosa outlier lies. This illustrates a general property of AA: archetypes anchor at the geometric boundary, so even a single extreme observation can pull a vertex toward it. For analyses where outliers are a concern, see *Robust fitting*. When samples are arranged by PC1, the composition barplot shows a transition from virginica-like to setosa-like samples, with the hybrid archetype in between. ```{r iris-compositions, fig.cap = "Composition barplot: each bar is one iris sample. Single-color bars are near-pure specimens; blended bars sit between archetypes."} plot(fit_iris, what = "compositions", main = "Archetype compositions across iris samples", cluster_rows = "PC1", cluster_cols = FALSE, col = pal_aa ) ``` Other potential orderings of the samples include: - the angle each sample forms in the PC1-PC2 plane (`cluster_rows = "AOP"`), or - the order that `hclust()` arranges them if `cluster_rows = TRUE` See `plot_archetypes_compositions()` for details. ### Predicting reconstructions and compositions for new observations Once a model is fitted, `predict()` maps new observations to the archetype convex hull without re-fitting. The resulting projection can be represented in two ways: - in the original feature space as the training data (default) with `type = "reconstruction"` - as barycentric coordinates in the archetype simplex with `type = "compositions"` ```{r predict} new_obs <- matrix( c(5.1, 3.5, 1.4, 0.2, # setosa-like 6.7, 3.0, 5.2, 2.3), # virginica-like nrow = 2, byrow = TRUE, dimnames = list(c("obs1", "obs2"), colnames(iris[, 1:4])) ) print(new_obs) predict(fit_iris, newdata = new_obs) predict(fit_iris, newdata = new_obs, type = "compositions") ``` ## Practical considerations ### Choosing `K` The main tuning parameter in archetypal analysis is `K`: the number of archetypes. A traditional exploratory approach is to fit a sequence of models and inspect a scree plot or elbow plot. The elbow is the smallest `K` after which reconstruction quality metrics improve only modestly. Typical metrics used for this purpose are: - **Reconstruction loss** (`loss`): the sum of squared residuals between the original data and the reconstructed data. - **Unexplained variance** (`1 - r2`): the proportion of variance in the original data that is not captured by the archetypes. Alongside elbow plots, `yaap` also provides an **adapted AIC criterion** (`AIC`) based on [@Suleman2017] that balances model fit and complexity, which can be used to select `K` more systematically. Unlike the other metrics we discussed, the adapted AIC is not monotonic in `K`, so we are not looking for an elbow but rather for the minimum value, which indicates the best trade-off between fit and parsimony. Here we fit a "path" of archetypes models for `K = 1, ..., 6` to the toy dataset, using the `archetypes_path()` function, which fits multiple models in a single call and reuses preprocessing. ```{r k-selection-fit, warning=FALSE} fit_path <- archetypes_path( toy, K = 6, init = "random", max_iter = 100, nrep = 5 ) ``` The resulting `fit_path` object contains the fitted models for all `K`, which can be accessed with `[[` using their index or `K` value, for example `fit_path[[3]]` (third model) or `fit_path[["K3"]]` (model with `K = 3`). The path object also supports plotting the scree and adapted AIC curves across `K` values with `screeplot()`. ```{r k-selection-plot, results = "hold", fig.width = 9, fig.height = 3.5, out.width = "100%", fig.cap = "RSS, unexplained variance, and adapted AIC across candidate values of K."} op <- par(mfrow = c(1, 3), mar = c(4, 4, 2.5, 0.5), bty = "n") selected_K <- 3 fit_best <- fit_path[[selected_K]] # here equivalent to fit_path[["K3"]] screeplot( fit_path, y = "loss", col = "steelblue4", frame.plot = FALSE, ylab = "Reconstruction loss", main = "RSS" ) abline(v = selected_K, lty = 2, col = "firebrick") screeplot( fit_path, y = function(fit) 1 - utils::tail(fit$loss$r2, 1L), col = "steelblue4", frame.plot = FALSE, ylab = expression(1 - R^2), main = "Unexplained Variance" ) abline(v = selected_K, lty = 2, col = "firebrick") screeplot( fit_path, col = "steelblue4", frame.plot = FALSE, ylab = "Adapted AIC", main = "Adapted AIC" ) abline(v = selected_K, lty = 2, col = "firebrick") par(op) ``` For this dataset, all three diagnostics agree that the optimal number of archetypes is `K = 3`, the true number of generating vertices. The RSS and unexplained-variance curves show a clear elbow at `K = 3`. The adapted AIC has a strong minimum at `K = 3`. In real analyses, these plots are diagnostics rather than automatic rules. AA optimization is non-convex, so single-restart curves can be noisy and may not decrease monotonically for every larger `K`. Use more restarts and domain knowledge when the elbow or adapted AIC is ambiguous. ### Scaling | Situation | `scale` value | Effect | |-----------------------------------------|----------------|-----------------| | Default/natural feature metric | `FALSE` | Data used as-is | | Mixed-unit Gaussian features | `TRUE` | z-score columns | | Custom feature importance | Numeric vector | Divides columns | The standard AA assumes Euclidean geometry and Gaussian error structure, so the optimization is carried out by minimizing the sum of squared residuals (RSS). Scaling can therefore materially affect the results. By default, `run_aa()` does not transform the input data. The decision of whether to scale is similar to that in PCA: if the features are on different scales, set `scale = TRUE` to give them equal weight in the analysis. Otherwise, those features may have little influence on the fit. Custom scaling by passing a vector of scale factors is also supported similar to how `scale()` works in base R (by dividing the columns). The difference between manually normalizing the data and setting `scale = FALSE` versus letting `run_aa()` handle the scaling internally is that the archetypes are returned on the scale of the input data. For interpretability, pass data on the scale you want the resulting archetypes to be in. Besides `scale`, `run_aa` also has an `sd_threshold = 1e-6` argument to remove features with near-zero variance, which can cause numerical instability and convergence issues. During reconstruction, the removed features are filled with the mean value of the original data. When some samples are more reliable or informative than others, pass sample weights through the `weights` argument. ### Multiple restarts Because the AA objective is non-convex, the solution is sensitive to the initial archetype coordinates. `run_aa()` supports multiple independent restarts via the `nrep` argument. Setting `nrep > 1` returns only the best solution across all runs, so the output is always a single `archetypes` object. Multiple restarts are most useful when the initializer can produce different starting points across runs. The default initialization method is `"furthest_sum"`. Although its starting seed is random, the procedure tends to have low variance because it greedily seeks a good initial guess. In contrast, `"random"` produces very different starting points, which lets the search explore the space more broadly. Below we run both methods five times on the `iris` dataset with `K = 3` and compare the initial and final loss across runs. ```{r furthest-sum-experiment} runs <- do.call(rbind, lapply(1:5, function(k) { fit_fs <- run_aa(iris[, 1:4], K = 3, nrep = 1) fit_uni <- run_aa(iris[, 1:4], K = 3, nrep = 1, init = "random") data.frame( run = k, fs_start = fit_fs$loss$loss[1], fs_end = tail(fit_fs$loss$loss, 1), uni_start = fit_uni$loss$loss[1], uni_end = tail(fit_uni$loss$loss, 1) ) })) colnames(runs) <- c("run", "furthest_sum start", "furthest_sum end", "random start", "random end") knitr::kable(runs, digits = 1) ``` `furthest_sum` starts closer to the final solution, while `random` starts are more variable. However, both methods converge in similarly good solutions, with the `random` method often even outperforming `furthest_sum` in final loss. See the **Initialization vignette** for a more detailed discussion and comparison of all available initializers. ### Convergence If a `run_aa()` call does not converge, it emits a warning. The status of the fit is stored in the `converged` slot of the output object, and is also printed out when the object or its summary are printed. The full optimization trajectory is available in the `loss` data frame and can be plotted with `plot(fit, what = "loss")`. ```{r check-convergence, fig.cap = "RSS across iterations for the iris model."} fit_iris$converged tail(fit_iris$loss, 1) plot(fit_iris, what = "loss", frame.plot = FALSE, type = "b", main = "Convergence of the iris model") ``` Typically, loss decreases quickly at first and then plateaus near a local minimum. If the model has not converged, increase `max_iter` (default 100) or add more restarts. If the loss decreases very slowly, increasing `max_iter` is unlikely to help; consider increasing `tol` or `tol_r2` instead. This will not improve the fit, but it can reduce runtime and avoid unnecessary non-convergence warnings. --- ## Variants of Canonical AA and their implementation in yaap This vignette covers only the Euclidean geometries with Gaussian errors setting. For **Non-Gaussian and Metric Variants of Archetypal Analysis** see the corresponding vignette. ### Relaxing the convex hull constraint with `delta` By default (`delta = 0`) archetypes are constrained to the *convex hull* of the data, i.e., they are strictly weighted averages of actual observations. When the data is a **truncated** version of a larger pattern, the genuine extremes are absent and the constrained model cannot recover the true vertices and instead anchors to the most extreme observed points. Setting `delta > 0` relaxes this constraint and allows archetypes to step *outside* the data hull [@Morup2012]. The following example makes this concrete with synthetic data drawn from a **truncated equilateral simplex**: any composition weight above 0.8 is removed, so no sample is close to any true vertex. ```{r delta-data-fit, warning=FALSE, fig.cap = "Truncated simplex: Larger delta lets archetypes escape the data hull and approach the true positions."} K_d <- 3 # number of archetypes (simplex vertices) M_d <- 2 # number of features (2D for easy visualization) N_init <- 1000 # initial number of samples before truncation thresh <- 0.8 # truncation threshold on composition weights # True archetypes: equilateral triangle A_true <- matrix( c(cos(seq(0, 2*pi*(K_d-1)/K_d, length.out = K_d)), sin(seq(0, 2*pi*(K_d-1)/K_d, length.out = K_d))), nrow = K_d, ncol = M_d ) # Dirichlet-like mixing weights; remove observations near any vertex S_raw <- -log(matrix(runif(K_d * N_init), nrow = N_init, ncol = K_d)) S_raw <- S_raw / rowSums(S_raw) S_trunc <- S_raw[rowSums(S_raw > thresh) == 0, ] X_delta <- S_trunc %*% A_true print(nrow(X_delta)) # observations after truncation # Fit models with different delta values delta_vals <- c(0, 0.2, 0.35) res_delta <- lapply( delta_vals, function(d) archetypes_pgd(X_delta, K = K_d, delta = d) ) # Plot all models together with the true archetypes ix <- c(1:3, 1) # close the triangle cols <- c("steelblue", "darkorange", "firebrick") plot(A_true[ix, 1], A_true[ix, 2], type = "l", col = "black", lwd = 2, lty = 3, frame.plot = FALSE, xlab = "", ylab = "", xaxt = "n", yaxt = "n", asp = 1, main = "Effect of delta on recovered archetypes") points(X_delta, pch = 19, cex = 0.4, col = "gray50") for (i in seq_along(res_delta)) { A <- coordinates(res_delta[[i]]) lines(A[ix, 1], A[ix, 2], col = cols[i], lwd = 2) } legend("topright", legend = c("True simplex", paste0("\u03b4 = ", delta_vals)), col = c("black", cols), lty = c(3, 1, 1, 1), lwd = 2, bty = "n") ``` With `delta = 0` the estimated triangle is visibly smaller than the true one and its vertices all lie within the data cloud. Increasing `delta` allows archetypes to reach outside the cloud toward the true vertices. Use `delta` with care: very large values allow archetypes to drift far from the data, losing their interpretability as extreme prototypes. ### Robust fitting Standard AA minimizes squared residuals, making it sensitive to outliers. Setting `robust = TRUE` applies MASS-style `"psi.bisquare"` row reweighting: observations with large residuals receive reduced influence at each iteration, so outliers cannot monopolize an archetype. Below we augment the `iris` dataset with five outliers and compare the results of standard and robust AA. Since the default initialization method (`furthest_sum`) is also sensitive to outliers, we set `init = "random"` to minimize their influence. ```{r robust-data} # Append five outlier observations to iris N_out <- 5 displacement <- c(2, 2, 0, 0) # displace in the first and second features add_outliers <- function(X, N, displacement) { x_mean <- colMeans(X) + displacement # place outliers at a distance from the mean x_sd <- apply(X, 2, sd) * .75 X_out <- rnorm(N * ncol(X), x_mean, x_sd) # recycle x_mean, x_sd rbind(X, matrix(pmax(X_out, .1), nrow = N, byrow = TRUE)) } X <- iris[, 1:4] |> as.matrix() |> add_outliers(N = N_out, displacement = displacement) set.seed(123) # fix starting positions fit_std <- run_aa(X, K = 3, robust = FALSE, tol = 1e-4, init = "random", nrep = 5) set.seed(123) # same starting positions fit_rob <- run_aa(X, K = 3, robust = TRUE, tol = 1e-4, init = "random", nrep = 5) ``` In standard AA, one archetype is pulled toward the outliers. In robust AA, the outliers receive less influence and the archetypes remain near the original data structure. ```{r robust-compare, fig.cap = "PCA projection of iris with five added outliers (crosses). Standard AA (red) is distorted toward the outlier cluster; robust AA (blue) closely recovers the reference triangle fitted on the clean data (dashed black)."} # Project all archetypes to PC space of data pca <- prcomp(X, scale. = TRUE) A_std <- predict(pca, coordinates(fit_std)) A_rob <- predict(pca, coordinates(fit_rob)) A_ori <- predict(pca, coordinates(fit_iris)) # reference: fitted on clean iris # Plot data points plot(pca$x, main = "PCA projection of iris data with outliers", col = "gray60", pch = 19, cex = 0.5, frame.plot = FALSE, asp = 1) # Add outliers points(pca$x[nrow(iris) + 1:N_out, ], col = "black", pch = 8) # Add Archetypes ix <- c(1:3, 1) lines(A_ori[ix, ], col = "black", lwd = 1.5, lty = 2) lines(A_std[ix, ], col = "firebrick", lwd = 2) points(A_std, col = "firebrick", pch = 17, cex = 1.5) lines(A_rob[ix, ], col = "steelblue", lwd = 2) points(A_rob, col = "steelblue", pch = 17, cex = 1.5) # Add legend legend("topright", legend = c("Data", "Outliers", "Reference AA", "Standard AA", "Robust AA"), col = c("gray60", "black", "black", "firebrick", "steelblue"), pch = c(19, 8, NA, 17, 17), lty = c(NA, NA, 2, 1, 1), lwd = 2, bty = "n") ``` The `robust` argument also accepts MASS psi names such as `"psi.huber"` and `"psi.hampel"`; pass tuning constants with `robust_args`. See `MASS::rlm()` for psi details. The `MASS::rlm()` `method = "MM"` estimator is not supported because it is regression-specific and not directly applicable to the AA objective. ### Missing data When `X` contains `NA` values, `run_aa()` automatically switches to the missing-data PGD objective (`missing = TRUE`): only the *observed* entries contribute to the gradient, and the loss is computed over those entries only. ```{r missing-data, warning=FALSE} X <- as.matrix(iris[, 1:4]) # Introduce ~10 % missing at random missing_rate <- 0.1 n_elem <- length(X) # number of elements X_miss <- X idx_miss <- sample(n_elem, size = round(missing_rate * n_elem)) X_miss[idx_miss] <- NA set.seed(123) fit_miss <- run_aa(X_miss, K = 3, nrep = 10, init = "random") # missing = TRUE auto-detected fit_miss ``` The fitted model reconstructs missing entries, which can be compared with the original complete data. ```{r missing-fitted} X_hat_miss <- fitted(fit_miss) any(is.na(X_hat_miss)) # FALSE = all positions reconstructed # compare reconstructed vs original values at missing positions cor(X_hat_miss[idx_miss], X[idx_miss]) ``` --- ## Algorithm details ### Algorithmic approaches: NNLS vs PGD The CRAN package [`archetypes`](https://cran.r-project.org/package=archetypes) implements AA via **alternating NNLS** (non-negative least squares): at each step a constrained NNLS subproblem is solved via `nnls::nnls()`. `yaap`'s `method = "nnls"` implements this alternating NNLS approach. The default `method = "pgd"` uses **projected gradient descent** instead. PGD skips the inner NNLS solves working directly on the gradient of the reconstruction error. Unlike NNLS, which solves a constrained subproblem at each step, PGD works entirely with matrix multiplications, avoiding the overhead of an inner solver. The following benchmark on the toy dataset with 10 restarts illustrates this difference: ```{r benchmark, warning=FALSE} toy_mat <- as.matrix(toy) nrep_bench <- 10 seeds <- 2046 + seq_len(nrep_bench) run_single <- function(method, seed, ...) { set.seed(seed) fit <- suppressWarnings(run_aa( toy_mat, K = 3, nrep = 1, method = method, init = "random", ... )) c(n_iter = nrow(fit$loss), final_loss = norm(resid(fit), "F")) } # PGD benchmark t_pgd <- system.time( res_pgd <- vapply(seeds, run_single, numeric(2), method = "pgd") )[["elapsed"]] n_pgd <- round(median(res_pgd["n_iter", ])) # NNLS benchmark t_nnls <- system.time( res_nnls <- vapply(seeds, run_single, numeric(2), method = "nnls") )[["elapsed"]] n_nnls <- round(median(res_nnls["n_iter", ])) # Run by run comparison win_pgd <- sum(res_pgd["final_loss", ] < res_nnls["final_loss", ]) loss_diff <- res_pgd["final_loss", ] - res_nnls["final_loss", ] diff_pgd <- median(loss_diff / res_nnls["final_loss", ]) diff_nnls <- median(loss_diff / res_pgd["final_loss", ]) bench <- data.frame( Method = c("PGD", "NNLS"), `Median iters` = c(n_pgd, n_nnls), `ms / iter` = 1000 * c(t_pgd / n_pgd, t_nnls / n_nnls), `loss change (%)`= 100 * c(diff_pgd, -diff_nnls), `Wins` = c(win_pgd, nrep_bench - win_pgd), check.names = FALSE ) knitr::kable(bench, digits = 2) ``` Because PGD works directly on the gradient, it also supports extensions such as the slack terms used in the `delta` relaxation and missing residuals in the `missing = TRUE` objective. Both methods can handle robust psi reweighting, as well as weighting schemes passed via the `scale` and `weights` arguments. ### PGD Algorithm The algorithm is described in detail [@Morup2012] and is implemented in the `archetypes_pgd()` function. The solver computes partial derivatives of the loss with respect to each variable, then iteratively updates them in a descent direction. The main steps are: 1. **Initialization**: selects K well-separated data points, giving good initial coverage of the data hull. 2. **S-step**: for fixed archetypes $A$, update compositions $S$ 3. **B-step**: for fixed $S$, update the archetype generators $B$ 4. **a-step** (optional): in case of `delta > 0`, update the slacks `a`, keeping $B$ and $S$ fixed. 5. **Repeat** until the relative change in RSS drops below `tol`, or $R^2$ exceeds `tol_r2`. The problems are solved via projected gradient descent with [line search and backtracking](https://en.wikipedia.org/wiki/Backtracking_line_search). The step size starts at `step_size` and is multiplied by `step_shrinkage` after each rejected proposal; if no improvement is found after `max_iter_optimizer` reductions, the step is skipped entirely. When `pseudo_pgd = TRUE` (default), the gradient is modified before each step to remove the radial direction (that changes the magnitude of the variables), keeping the proposed update close to the simplex surface. If no update is accepted after `max_no_update` backtracking steps, the algorithm halts with a warning of non-convergence. #### Key arguments | Argument | Default | Description | |-----------------------|---------|-------------------------------------------------------| | `max_iter` | 100 | Maximum outer iterations | | `tol` | 1e-4 | Relative RSS change stopping criterion | | `tol_r2` | 0.9999 | R² early-stop threshold | | `nrep` | 1 | Independent random restarts (best returned) | | `scale` | `FALSE` | No column scaling by default | | `delta` | 0 | Convex-hull relaxation (see § *delta*) | | `pseudo_pgd` | `TRUE` | Remove radial component of gradient before step | | `step_size` | 1.0 | Initial step size for the line search | | `max_iter_optimizer` | 10 | Maximum back-tracking steps per outer iteration | | `step_shrinkage` | 0.5 | Shrinkage factor per back-track | | `max_no_update` | 5 | Consecutive no-improvement steps before early halt | #### Loss tracking As introduced in the first example, the `$loss` data frame tracks one row per logged iteration; the column meanings are summarized there. The full sequence for the iris model is: ```{r loss-df} head(fit_iris$loss, 6) ``` The loss metrics may differ from the direct computation (for example `norm(X - fitted(fit_iris), "F")^2`) because the loss is computed on the scaled data or other internal transformations. Thus, the `loss` metrics should only be used for monitoring convergence and comparing different runs of the same model type, not for comparing across different models or datasets. That is why we use the generic term "loss" instead of "RSS" in the column name. The condition numbers `k_S` and `k_A` are diagnostics of numerical stability and are especially relevant for the case of `method = "nnls"` because it involves solving linear systems with these matrices. High `k_S` or `k_A` values signal numerical issues and are worth investigating with additional restarts or a larger `delta`. These diagnostics usually require attention only when convergence is unstable. --- ## Conclusion ### Limitations and challenges Some important practical considerations and limitations of AA: - **Presence of (Poor) Local Minima**: The AA objective is non-convex, so the algorithm is not guaranteed to find the global minimum. - **Initialization sensitivity**: Because of the presence of local minima, the choice of initial archetypes can influence the final solution. - **Number of archetypes**: The number of archetypes is the most important tuning parameter, and its choice is subjective and can affect the interpretability and quality of the fit. - **Sensitivity to Convex Hull**: Standard AA can be sensitive to outliers or misspecification when the true extremes lie outside the observed convex hull. - **Euclidean assumption**: The standard AA formulation assumes a Euclidean geometry (Frobenius norm), which may not be appropriate for all data types. Several methods have been proposed to mitigate these issues and this package implements some of them. The most common strategies are: - **Multiple restarts and initialization**: To combat non-convexity and initialization sensitivity, it is common to run AA with multiple independent restarts (`nrep`) and select the best solution. Sophisticated initialization strategies (see the **"Choosing an Initialization Method for Archetypal Analysis"** vignette) can also avoid poor local minima and improve both solution quality and convergence speed. - **Choosing the number of archetypes**: The number of archetypes `K` is the most important tuning parameter. The choice can be guided by domain knowledge, cross-validation, or information criteria, but it remains a subjective decision that affects the interpretability and quality of the fit. This package provides tools to help with this choice through `archetypes_path()` and `screeplot()`. - **Robust fitting**: Because archetypes are constrained to be convex combinations of the data, an outlier can attract an entire archetype to itself, leading to a potentially misleading result. The option `robust = TRUE` applies robust psi row reweighting to reduce the influence of such observations (see *Robust fitting*). - **Delta relaxation**: If the genuine extremes of the underlying data-generating process are not observed, standard AA will be unable to recover them. To accommodate this situation, the `delta` relaxation allows archetypes to step outside the data hull and better capture true extremes when the data is truncated. - **Non-Gaussian and metric variants**: The standard AA formulation assumes a Euclidean geometry (Frobenius norm), which may not be appropriate for all data types. Extensions of AA to other geometries and data types are presented in the **"Non-Gaussian and Metric Variants of Archetypal Analysis"** vignette. ### Summary | Task | Function / accessor | |-------------------------------|--------------------------------| | Fit a model (matrix) | `run_aa(X, K, ...)` | | Fit a model (formula) | `run_aa(y ~ ., data, K, ...)` | | Extract compositions ($S$) | `compositions(fit)` | | Extract coordinates ($A$) | `coordinates(fit)` | | Extract coefficients ($B$) | `coef(fit)` | | Reconstruct data | `fitted(fit)` | | Residuals | `residuals(fit)` | | Predict new reconstructions | `predict(fit, newdata)` | | Predict new compositions | `predict(fit, newdata, type = "compositions")` | | Rename archetypes | `anames(fit) <- c(...)` | | Loss / convergence | `fit$loss`, `fit$converged` | | Feature profile plot | `plot(fit, "profiles")` | | Composition barplot | `plot(fit, "compositions")` | | Simplex plot | `plot(fit, "simplex")` | | Coordinate scatter | `plot(fit, "coordinates")` | | Convergence plot | `plot(fit, "loss")` | | Robust fitting | `run_aa(..., robust = TRUE)` | | Missing-data fitting | `run_aa(...)` (auto-detects) | ## Session info ```{r session-info} sessionInfo() ``` ## References