--- title: "Variable selection without an outcome: unsupervised varPro" author: "John Ehrlinger" date: today format: html: fig-format: png fig-dpi: 96 toc: true toc-depth: 2 html-math-method: mathjax code-fold: false bibliography: ggRandomForests.bib vignette: > %\VignetteIndexEntry{Variable selection without an outcome: unsupervised varPro} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( message = FALSE, warning = FALSE, fig.width = 7, fig.height = 4.5, cache = TRUE, cache.path = "uvarpro_cache/" ) options(mc.cores = 1, rf.cores = 1) ``` ```{r libs} library(ggplot2) # Match the pattern used by the other vignettes: try the installed # package first, fall back to pkgload::load_all() for the R CMD check # vignette rebuild where the package isn't yet on .libPaths(). All varPro # calls below are ::-qualified, so no library(varPro) is needed. if (requireNamespace("ggRandomForests", quietly = TRUE)) { library(ggRandomForests) } else if (requireNamespace("pkgload", quietly = TRUE)) { pkgload::load_all(export_all = FALSE, helpers = FALSE, attach_testthat = FALSE) } else { stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.") } ``` # Importance without a response Most measures of variable importance start from a question: which predictors help explain *this* outcome? Permutation VIMP, the varPro release rules, the per-rule lasso weights behind `gg_beta_varpro()` — all of them score a variable by how much it moves a response `y`. The companion [varPro vignette](varpro.html) walks that supervised toolkit end to end. But sometimes there is no `y`. You have a matrix of predictors and you want to know which columns carry the structure of the data: which variables define its shape, which travel together, and which are close to noise. That is the job of unsupervised varPro. `varPro::uvarpro()` grows a forest on the predictor matrix alone, with no response, and scores each variable by how much entropy it contributes to the regions the forest carves out [@Lu2024varpro]. "Important" here does not mean useful for prediction. It means a variable helps reconstruct the feature space that the others cannot. This is a short vignette. It walks the three `gg_*` views of a single `uvarpro()` fit, and each one answers a different question about that fit: what depends on what, what ranks highest, and where to draw the line between signal and noise. # One fit, three views We'll use `mtcars`. Its columns are all numeric, and several of them measure closely related things — displacement, horsepower, weight, and cylinder count all track engine size — so the unsupervised structure is easy to read. ```{r fit} set.seed(1) u <- varPro::uvarpro(mtcars, ntree = 50) ``` Notice what `uvarpro()` was handed: the data frame, and nothing else. No formula, no outcome column singled out. All three views below read off this one fit. Two of them rest on the same `get.beta.entropy()` matrix, which is the only part worth caching, so we compute it once and pass it to both rather than paying for it twice: ```{r beta_fit} beta_fit <- varPro::get.beta.entropy(u) ``` ## What depends on what: `gg_udependent()` `gg_udependent()` reads cross-variable structure off the fit and draws it as a network: nodes are variables, edges are dependencies above a configurable threshold. The picture is built with `ggraph`, which lives in `Suggests` rather than `Imports`, so install it if you want this view. ```{r udep, eval = requireNamespace("ggraph", quietly = TRUE)} plot(gg_udependent(u)) ``` Clusters of mutually connected variables are worth a second look. Because the fit has no response, a cluster tells you these columns are correlated in feature space regardless of whether any of them would matter for a prediction. When you do have a downstream model, that is useful on its own: a variable can be important for prediction and still sit in a tight cluster with a near-duplicate, and in that case parsimony may favour dropping one member of the cluster without losing much. ## What ranks highest: `gg_beta_uvarpro()` The network shows *structure*; `gg_beta_uvarpro()` turns the same fit into a *ranking*. It is the unsupervised analogue of `gg_beta_varpro()`: from the entropy matrix it aggregates the per-region lasso coefficients into a mean absolute weight per variable (`beta_mean`), orders them most-important first, and flags the ones above a selection cutoff. Here we hand it the `beta_fit` we already computed. ```{r beta} plot(gg_beta_uvarpro(u, beta_fit = beta_fit)) ``` Read it as the unsupervised counterpart of a VIMP bar chart. The tall bars are the variables that most define the structure of the predictor space. Paired with the network above, you get both halves of the story: *which* variables carry the most unsupervised signal, and *how* they group. ## Where to draw the line: `gg_sdependent()` A ranking invites the obvious next question: where is the cut? Which variables are signal, and which are noise? `gg_sdependent()` answers that narrower question off the same fit. It wraps `varPro::sdependent()` and returns one row per candidate variable — an importance score, its degree in the dependency graph, and a `signal` flag — drawn as a ranked lollipop. ```{r sdep} plot(gg_sdependent(u, beta_fit = beta_fit)) ``` Where `gg_beta_uvarpro()` ranks *all* the variables, `gg_sdependent()` makes the cut explicit, separating the columns the unsupervised analysis treats as signal from the ones it treats as noise. # Reading the three together The three views are one workflow, not three unrelated plots. Start with `gg_udependent()` to see the structure, rank it with `gg_beta_uvarpro()`, then let `gg_sdependent()` draw the signal-versus-noise line. They all derive from the one `uvarpro()` fit, and two of them from the one `get.beta.entropy()` matrix, so computing that matrix once (as we did above) and passing it in keeps the whole sequence cheap. For the supervised side of varPro — VIMP, partial dependence, per-rule lasso refinement, local importance, and anomaly scoring — see the companion [varPro vignette](varpro.html). # References