---
title: "Variable selection without an outcome: unsupervised varPro"
author: "John Ehrlinger"
date: today
format:
  html:
    fig-format: png
    fig-dpi: 96
    toc: true
    toc-depth: 2
    html-math-method: mathjax
    code-fold: false
bibliography: ggRandomForests.bib
vignette: >
  %\VignetteIndexEntry{Variable selection without an outcome: unsupervised varPro}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  message    = FALSE,
  warning    = FALSE,
  fig.width  = 7,
  fig.height = 4.5,
  cache      = TRUE,
  cache.path = "uvarpro_cache/"
)
options(mc.cores = 1, rf.cores = 1)
```

```{r libs}
library(ggplot2)

# Match the pattern used by the other vignettes: try the installed
# package first, fall back to pkgload::load_all() for the R CMD check
# vignette rebuild where the package isn't yet on .libPaths(). All varPro
# calls below are ::-qualified, so no library(varPro) is needed.
if (requireNamespace("ggRandomForests", quietly = TRUE)) {
  library(ggRandomForests)
} else if (requireNamespace("pkgload", quietly = TRUE)) {
  pkgload::load_all(export_all = FALSE, helpers = FALSE,
                    attach_testthat = FALSE)
} else {
  stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.")
}
```

# Importance without a response

Most measures of variable importance start from a question: which predictors
help explain *this* outcome? Permutation VIMP, the varPro release rules, the
per-rule lasso weights behind `gg_beta_varpro()` — all of them score a variable
by how much it moves a response `y`. The companion
[varPro vignette](varpro.html) walks that supervised toolkit end to end.

But sometimes there is no `y`. You have a matrix of predictors and you want to
know which columns carry the structure of the data: which variables define its
shape, which travel together, and which are close to noise. That is the job of
unsupervised varPro. `varPro::uvarpro()` grows a forest on the predictor matrix
alone, with no response, and scores each variable by how much entropy it
contributes to the regions the forest carves out [@Lu2024varpro]. "Important"
here
does not mean useful for prediction. It means a variable helps reconstruct the
feature space that the others cannot.

This is a short vignette. It walks the three `gg_*` views of a single
`uvarpro()` fit, and each one answers a different question about that fit: what
depends on what, what ranks highest, and where to draw the line between signal
and noise.

# One fit, three views

We'll use `mtcars`. Its columns are all numeric, and several of them measure
closely related things — displacement, horsepower, weight, and cylinder count
all track engine size — so the unsupervised structure is easy to read.

```{r fit}
set.seed(1)
u <- varPro::uvarpro(mtcars, ntree = 50)
```

Notice what `uvarpro()` was handed: the data frame, and nothing else. No
formula, no outcome column singled out. All three views below read off this one
fit. Two of them rest on the same `get.beta.entropy()` matrix, which is the only
part worth caching, so we compute it once and pass it to both rather than paying
for it twice:

```{r beta_fit}
beta_fit <- varPro::get.beta.entropy(u)
```

## What depends on what: `gg_udependent()`

`gg_udependent()` reads cross-variable structure off the fit and draws it as a
network: nodes are variables, edges are dependencies above a configurable
threshold. The picture is built with `ggraph`, which lives in `Suggests` rather
than `Imports`, so install it if you want this view.

```{r udep, eval = requireNamespace("ggraph", quietly = TRUE)}
plot(gg_udependent(u))
```

Clusters of mutually connected variables are worth a second look. Because the
fit has no response, a cluster tells you these columns are correlated in feature
space regardless of whether any of them would matter for a prediction. When you
do have a downstream model, that is useful on its own: a variable can be
important for prediction and still sit in a tight cluster with a near-duplicate,
and in that case parsimony may favour dropping one member of the cluster without
losing much.

## What ranks highest: `gg_beta_uvarpro()`

The network shows *structure*; `gg_beta_uvarpro()` turns the same fit into a
*ranking*. It is the unsupervised analogue of `gg_beta_varpro()`: from the
entropy matrix it aggregates the per-region lasso coefficients into a mean
absolute weight per variable (`beta_mean`), orders them most-important first,
and flags the ones above a selection cutoff. Here we hand it the `beta_fit` we
already computed.

```{r beta}
plot(gg_beta_uvarpro(u, beta_fit = beta_fit))
```

Read it as the unsupervised counterpart of a VIMP bar chart. The tall bars are
the variables that most define the structure of the predictor space. Paired with
the network above, you get both halves of the story: *which* variables carry the
most unsupervised signal, and *how* they group.

## Where to draw the line: `gg_sdependent()`

A ranking invites the obvious next question: where is the cut? Which variables
are signal, and which are noise? `gg_sdependent()` answers that narrower
question off the same fit. It wraps `varPro::sdependent()` and returns one row
per candidate variable — an importance score, its degree in the dependency
graph, and a `signal` flag — drawn as a ranked lollipop.

```{r sdep}
plot(gg_sdependent(u, beta_fit = beta_fit))
```

Where `gg_beta_uvarpro()` ranks *all* the variables, `gg_sdependent()` makes the
cut explicit, separating the columns the unsupervised analysis treats as signal
from the ones it treats as noise.

# Reading the three together

The three views are one workflow, not three unrelated plots. Start with
`gg_udependent()` to see the structure, rank it with `gg_beta_uvarpro()`, then
let `gg_sdependent()` draw the signal-versus-noise line. They all derive from the
one `uvarpro()` fit, and two of them from the one `get.beta.entropy()` matrix, so
computing that matrix once (as we did above) and passing it in keeps the whole
sequence cheap.

For the supervised side of varPro — VIMP, partial dependence, per-rule lasso
refinement, local importance, and anomaly scoring — see the companion
[varPro vignette](varpro.html).

# References