---
title: "Tidymodels Workflows with yaap"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Tidymodels Workflows with yaap}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.width = 7,
    fig.height = 5,
    fig.align = "center",
    warning = FALSE
)
set.seed(42)
par(bty = "n")
```

## Overview

`yaap` fits archetypal analysis (AA) models. In a tidymodels workflow,
it is best understood as an unsupervised decomposition or feature-extraction
tool, closer in role to `recipes::step_pca()` than to a supervised `parsnip`
model specification. `yaap` learns archetype coordinates and sample
composition weights, then exposes those results through interfaces that work
naturally with the tidymodels ecosystem:

- `step_archetypes()` is a `recipes` step that learns archetypes during
  `prep()` and returns archetype composition weights during `bake()`.

This vignette uses only lightweight pieces of the ecosystem. Because AA is an
unsupervised representation method, the examples focus on preprocessing,
feature extraction, and model inspection rather than on `parsnip` fitting.

```{r libraries, message = FALSE}
library(yaap)
library(generics)
library(ggplot2)
library(ggtern)
library(recipes)
library(tune)
```

The examples use the built-in `USArrests` data, the same data used in the
`recipes::step_pca()` examples. The four numeric variables are on different
scales, so the recipes below normalize them before fitting either PCA or AA.
State regions are used only for visualization.

```{r load-usarrests}
arrests <- USArrests
state <- rownames(arrests)
region <- state.region[match(state, state.name)]
analysis_K <- 3
region_cols <- c(
    "Northeast" = "#0072B2",
    "South" = "#D55E00",
    "North Central" = "#009E73",
    "West" = "#CC79A7"
)

head(data.frame(state = state, region = region, arrests, row.names = NULL))
```

The examples below use `K = 3` archetypes so that the resulting composition
weights can be inspected on a two-dimensional simplex.

## Using Archetypes in Recipes

`step_archetypes()` turns AA into a preprocessing step. During `prep()`, the
step fits archetypes on the selected numeric predictors. During `bake()`, it
projects data onto the learned archetype simplex and returns one composition
column per archetype.

The recipes argument is called `num_comp`, following tidymodels naming
conventions, but it is the same quantity as `K`.

```{r recipe-fit}
rec_base <- recipe(~ ., data = arrests) |>
    step_normalize(all_numeric())

rec <- rec_base |>
    step_archetypes(
        all_numeric(),
        num_comp = analysis_K,
        seed = 42,
        options = list(
            init = "furthest_sum",
            max_iter = 100,
            tol_r2 = 0
        )
    )

rec_prep <- prep(rec, training = arrests)
rec_prep
```

```{r recipe-bake}
baked <- bake(rec_prep, new_data = arrests)
head(data.frame(state = state, region = region, baked, row.names = NULL))

aa_cols <- paste0("A", seq_len(analysis_K))
baked[, aa_cols] |> as.matrix() |> rowSums() |> range()
```

By default, the original numeric predictors are removed and replaced by
composition columns named `A1`, `A2`, and so on. Set
`keep_original_cols = TRUE` when both the normalized predictors and archetype
weights should be retained.

```{r recipe-keep-original}
rec_keep <- rec_base |>
    step_archetypes(
        all_numeric(),
        num_comp = analysis_K,
        keep_original_cols = TRUE,
        seed = 42,
        options = list(
            max_iter = 100,
            tol_r2 = 0
        )
    ) |>
    prep(training = arrests)

arrests_with_originals <- data.frame(
    state = state,
    region = region,
    bake(rec_keep, new_data = arrests),
    row.names = NULL
)

head(arrests_with_originals)
```

## Comparing `step_archetypes()` with `step_pca()`

The closest tidymodels analogy for `yaap` is `recipes::step_pca()`: both steps
learn an unsupervised representation of the numeric predictors during `prep()`.
The outputs have different meanings. PCA returns orthogonal score columns such
as `PC1`, `PC2`, and `PC3`; AA returns composition weights such as `A1`, `A2`,
and `A3` that are non-negative and sum to one for each state.

The two recipes below are trained in parallel on the same normalized arrest
variables.

```{r compare-pca-aa}
rec_aa <- rec_base |>
    step_archetypes(
        all_numeric(),
        num_comp = analysis_K,
        seed = 42,
        options = list(
            init = "furthest_sum",
            max_iter = 100,
            tol_r2 = 0
        )
    ) |>
    prep(training = arrests)

rec_pca <- rec_base |>
    step_pca(all_numeric(), num_comp = analysis_K) |>
    prep(training = arrests)

aa_scores <- bake(rec_aa, new_data = arrests)
pca_scores <- bake(rec_pca, new_data = arrests)

score_data <- data.frame(
    state = state,
    region = region,
    pca_scores,
    aa_scores,
    row.names = NULL
)
head(score_data)
```

```{r comparison-labels, include=FALSE}
aa_cols <- grep("^A[0-9]+$", names(aa_scores), value = TRUE)
aa_mat <- as.matrix(score_data[, aa_cols])

simplex_label_ix <- vapply(seq_along(aa_cols), function(j) {
    which.max(aa_mat[, j])
}, integer(1L))
pca_label_ix <- unique(c(
    which.min(score_data$PC1),
    which.max(score_data$PC1),
    which.min(score_data$PC2),
    which.max(score_data$PC2)
))
label_ix <- unique(c(pca_label_ix, simplex_label_ix))

score_plot_data <- transform(
    score_data,
    label = ifelse(seq_len(nrow(score_data)) %in% label_ix, state, NA)
)
```

The PCA step returns orthogonal score columns. The AA step returns convex
composition weights over three estimated archetypal states. Inspecting both
trained steps highlights the difference: PCA loadings describe directions of
variation, while AA coordinates describe extreme profiles in the normalized
feature space.

```{r compare-pca-aa-tidy}
tidy(rec_pca, number = 2, type = "variance")
tidy(rec_aa, number = 2)
```

Next, the scores from both steps are plotted. The PCA scores are plotted in the
usual Cartesian coordinate system. The AA composition weights are plotted on a
ternary simplex, where each vertex corresponds to one archetype.  The same
states are labeled in both plots: states selected either because they are
extreme on one of the first two principal-component axes or because they are
closest to an archetype vertex.

```{r pca-score-plot, fig.width = 6, fig.height = 5, fig.cap = "PCA scores for USArrests. Labels mark states selected as PCA extremes or AA simplex-vertex representatives."}
pca_rng <- extendrange(c(score_plot_data$PC1, score_plot_data$PC2))
ggplot(score_plot_data, aes(PC1, PC2, colour = region)) +
    geom_hline(yintercept = 0, colour = "grey85") +
    geom_vline(xintercept = 0, colour = "grey85") +
    geom_point(size = 1.8) +
    geom_text(
        aes(label = label),
        na.rm = TRUE,
        nudge_y = diff(pca_rng) * 0.035,
        size = 3,
        show.legend = FALSE
    ) +
    coord_equal(xlim = pca_rng, ylim = pca_rng) +
    scale_colour_manual(values = region_cols, drop = FALSE) +
    labs(title = "step_pca()", x = "PC1", y = "PC2", colour = "Region") +
    theme_minimal(base_size = 10) +
    theme(panel.grid.minor = element_blank())
```

```{r aa-simplex-plot, fig.width = 6, fig.height = 5, fig.cap = "AA composition weights for USArrests on the archetype simplex. Labels match the PCA plot for comparison."}
ggtern(score_plot_data, aes(A1, A2, A3, colour = region)) +
    geom_point(size = 1.8) +
    geom_text(
        aes(label = label),
        na.rm = TRUE,
        position = "identity",
        size = 3,
        show.legend = FALSE
    ) +
    scale_colour_manual(values = region_cols, drop = FALSE) +
    labs(
        title = "step_archetypes()",
        x = "A1",
        y = "A2",
        z = "A3",
        colour = "Region"
    ) +
    theme_minimal(base_size = 10) +
    theme_showarrows() +
    theme(
        legend.position = "bottom",
        plot.margin = margin(8, 16, 8, 16),
        tern.panel.mask.show = FALSE
    )
```

## Machine Learning Note

AA preprocessing returns scores that are barycentric coordinates: they are
non-negative and sum to one for each observation. Tree models and other flexible
learners can often use these scores directly. Linear, distance-based, or
Gaussian models may be sensitive to the simplex constraint and the induced
collinearity of the raw weights. In those settings, a typical transformation is
the isometric log-ratio (`ilr`), which maps compositions from the simplex to
ordinary Euclidean space. Log-ratio transforms require strictly positive
compositions, so apply an appropriate zero-handling strategy first if any
composition weights are zero. See package `compositions` for details on working
with compositional data, including `ilr` and other transformations that can be
used after `bake()` when a downstream model expects ordinary Euclidean
predictors.

## Declaring Tunable Parameters

`step_archetypes()` declares `num_comp` and `delta` as tunable recipe
parameters. This lets the step participate in a larger tidymodels tuning
workflow when the rest of the modeling pipeline supplies resampling, metrics,
and an outcome.

```{r tunable}
rec_tune <- rec_base |>
    step_archetypes(
        all_numeric(),
        num_comp = tune(),
        delta = tune()
    )

tunable(rec_tune$steps[[2L]])
```

For supervised modeling pipelines, use `step_archetypes()` inside the broader
tidymodels workflow and tune `num_comp` alongside the downstream model
parameters.