--- title: "Tidymodels Workflows with yaap" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Tidymodels Workflows with yaap} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, fig.align = "center", warning = FALSE ) set.seed(42) par(bty = "n") ``` ## Overview `yaap` fits archetypal analysis (AA) models. In a tidymodels workflow, it is best understood as an unsupervised decomposition or feature-extraction tool, closer in role to `recipes::step_pca()` than to a supervised `parsnip` model specification. `yaap` learns archetype coordinates and sample composition weights, then exposes those results through interfaces that work naturally with the tidymodels ecosystem: - `step_archetypes()` is a `recipes` step that learns archetypes during `prep()` and returns archetype composition weights during `bake()`. This vignette uses only lightweight pieces of the ecosystem. Because AA is an unsupervised representation method, the examples focus on preprocessing, feature extraction, and model inspection rather than on `parsnip` fitting. ```{r libraries, message = FALSE} library(yaap) library(generics) library(ggplot2) library(ggtern) library(recipes) library(tune) ``` The examples use the built-in `USArrests` data, the same data used in the `recipes::step_pca()` examples. The four numeric variables are on different scales, so the recipes below normalize them before fitting either PCA or AA. State regions are used only for visualization. ```{r load-usarrests} arrests <- USArrests state <- rownames(arrests) region <- state.region[match(state, state.name)] analysis_K <- 3 region_cols <- c( "Northeast" = "#0072B2", "South" = "#D55E00", "North Central" = "#009E73", "West" = "#CC79A7" ) head(data.frame(state = state, region = region, arrests, row.names = NULL)) ``` The examples below use `K = 3` archetypes so that the resulting composition weights can be inspected on a two-dimensional simplex. ## Using Archetypes in Recipes `step_archetypes()` turns AA into a preprocessing step. During `prep()`, the step fits archetypes on the selected numeric predictors. During `bake()`, it projects data onto the learned archetype simplex and returns one composition column per archetype. The recipes argument is called `num_comp`, following tidymodels naming conventions, but it is the same quantity as `K`. ```{r recipe-fit} rec_base <- recipe(~ ., data = arrests) |> step_normalize(all_numeric()) rec <- rec_base |> step_archetypes( all_numeric(), num_comp = analysis_K, seed = 42, options = list( init = "furthest_sum", max_iter = 100, tol_r2 = 0 ) ) rec_prep <- prep(rec, training = arrests) rec_prep ``` ```{r recipe-bake} baked <- bake(rec_prep, new_data = arrests) head(data.frame(state = state, region = region, baked, row.names = NULL)) aa_cols <- paste0("A", seq_len(analysis_K)) baked[, aa_cols] |> as.matrix() |> rowSums() |> range() ``` By default, the original numeric predictors are removed and replaced by composition columns named `A1`, `A2`, and so on. Set `keep_original_cols = TRUE` when both the normalized predictors and archetype weights should be retained. ```{r recipe-keep-original} rec_keep <- rec_base |> step_archetypes( all_numeric(), num_comp = analysis_K, keep_original_cols = TRUE, seed = 42, options = list( max_iter = 100, tol_r2 = 0 ) ) |> prep(training = arrests) arrests_with_originals <- data.frame( state = state, region = region, bake(rec_keep, new_data = arrests), row.names = NULL ) head(arrests_with_originals) ``` ## Comparing `step_archetypes()` with `step_pca()` The closest tidymodels analogy for `yaap` is `recipes::step_pca()`: both steps learn an unsupervised representation of the numeric predictors during `prep()`. The outputs have different meanings. PCA returns orthogonal score columns such as `PC1`, `PC2`, and `PC3`; AA returns composition weights such as `A1`, `A2`, and `A3` that are non-negative and sum to one for each state. The two recipes below are trained in parallel on the same normalized arrest variables. ```{r compare-pca-aa} rec_aa <- rec_base |> step_archetypes( all_numeric(), num_comp = analysis_K, seed = 42, options = list( init = "furthest_sum", max_iter = 100, tol_r2 = 0 ) ) |> prep(training = arrests) rec_pca <- rec_base |> step_pca(all_numeric(), num_comp = analysis_K) |> prep(training = arrests) aa_scores <- bake(rec_aa, new_data = arrests) pca_scores <- bake(rec_pca, new_data = arrests) score_data <- data.frame( state = state, region = region, pca_scores, aa_scores, row.names = NULL ) head(score_data) ``` ```{r comparison-labels, include=FALSE} aa_cols <- grep("^A[0-9]+$", names(aa_scores), value = TRUE) aa_mat <- as.matrix(score_data[, aa_cols]) simplex_label_ix <- vapply(seq_along(aa_cols), function(j) { which.max(aa_mat[, j]) }, integer(1L)) pca_label_ix <- unique(c( which.min(score_data$PC1), which.max(score_data$PC1), which.min(score_data$PC2), which.max(score_data$PC2) )) label_ix <- unique(c(pca_label_ix, simplex_label_ix)) score_plot_data <- transform( score_data, label = ifelse(seq_len(nrow(score_data)) %in% label_ix, state, NA) ) ``` The PCA step returns orthogonal score columns. The AA step returns convex composition weights over three estimated archetypal states. Inspecting both trained steps highlights the difference: PCA loadings describe directions of variation, while AA coordinates describe extreme profiles in the normalized feature space. ```{r compare-pca-aa-tidy} tidy(rec_pca, number = 2, type = "variance") tidy(rec_aa, number = 2) ``` Next, the scores from both steps are plotted. The PCA scores are plotted in the usual Cartesian coordinate system. The AA composition weights are plotted on a ternary simplex, where each vertex corresponds to one archetype. The same states are labeled in both plots: states selected either because they are extreme on one of the first two principal-component axes or because they are closest to an archetype vertex. ```{r pca-score-plot, fig.width = 6, fig.height = 5, fig.cap = "PCA scores for USArrests. Labels mark states selected as PCA extremes or AA simplex-vertex representatives."} pca_rng <- extendrange(c(score_plot_data$PC1, score_plot_data$PC2)) ggplot(score_plot_data, aes(PC1, PC2, colour = region)) + geom_hline(yintercept = 0, colour = "grey85") + geom_vline(xintercept = 0, colour = "grey85") + geom_point(size = 1.8) + geom_text( aes(label = label), na.rm = TRUE, nudge_y = diff(pca_rng) * 0.035, size = 3, show.legend = FALSE ) + coord_equal(xlim = pca_rng, ylim = pca_rng) + scale_colour_manual(values = region_cols, drop = FALSE) + labs(title = "step_pca()", x = "PC1", y = "PC2", colour = "Region") + theme_minimal(base_size = 10) + theme(panel.grid.minor = element_blank()) ``` ```{r aa-simplex-plot, fig.width = 6, fig.height = 5, fig.cap = "AA composition weights for USArrests on the archetype simplex. Labels match the PCA plot for comparison."} ggtern(score_plot_data, aes(A1, A2, A3, colour = region)) + geom_point(size = 1.8) + geom_text( aes(label = label), na.rm = TRUE, position = "identity", size = 3, show.legend = FALSE ) + scale_colour_manual(values = region_cols, drop = FALSE) + labs( title = "step_archetypes()", x = "A1", y = "A2", z = "A3", colour = "Region" ) + theme_minimal(base_size = 10) + theme_showarrows() + theme( legend.position = "bottom", plot.margin = margin(8, 16, 8, 16), tern.panel.mask.show = FALSE ) ``` ## Machine Learning Note AA preprocessing returns scores that are barycentric coordinates: they are non-negative and sum to one for each observation. Tree models and other flexible learners can often use these scores directly. Linear, distance-based, or Gaussian models may be sensitive to the simplex constraint and the induced collinearity of the raw weights. In those settings, a typical transformation is the isometric log-ratio (`ilr`), which maps compositions from the simplex to ordinary Euclidean space. Log-ratio transforms require strictly positive compositions, so apply an appropriate zero-handling strategy first if any composition weights are zero. See package `compositions` for details on working with compositional data, including `ilr` and other transformations that can be used after `bake()` when a downstream model expects ordinary Euclidean predictors. ## Declaring Tunable Parameters `step_archetypes()` declares `num_comp` and `delta` as tunable recipe parameters. This lets the step participate in a larger tidymodels tuning workflow when the rest of the modeling pipeline supplies resampling, metrics, and an outcome. ```{r tunable} rec_tune <- rec_base |> step_archetypes( all_numeric(), num_comp = tune(), delta = tune() ) tunable(rec_tune$steps[[2L]]) ``` For supervised modeling pipelines, use `step_archetypes()` inside the broader tidymodels workflow and tune `num_comp` alongside the downstream model parameters.