--- title: "Optimization Path Example: Portfolio Selection via Variance Minimization" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Optimization Path Example: Portfolio Selection via Variance Minimization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview ROOT is a **general functional optimization framework**: by supplying a custom objective function, it can be applied to any problem where the goal is to learn an interpretable binary weight function $w(X) \in \{0, 1\}$ over a set of groups by covariates of interest. In general optimization mode (`generalizability_path = FALSE`), the user provides a `data.frame` with an optional column `vsq` (a per-unit variance proxy, or the outcome to minimize) and any covariates to split on. ROOT searches over tree-structured weight functions to minimize the supplied objective, then returns a Rashomon set of near-optimal trees and a single summary tree characterizing the final weight assignments. The decision to vote on a single summary tree can be by default majority vote, or the user can specify their own voting functions. This vignette demonstrates ROOT in general optimization mode using a **portfolio selection** problem: given a universe of 100 simulated assets characterized by their market beta and annualized volatility, ROOT learns an interpretable rule for which assets to include ($w = 1$) in order to minimize portfolio return variance. --- ## Problem Setup ### Why portfolio selection? Constructing a minimum-variance portfolio could be framed as an optimization problem. The standard approach could produce weights that are continuous and can be hard to interpret or communicate. ROOT offers a complementary perspective: it returns a **binary inclusion rule** - a sparse decision tree that describes, in plain language, which types of assets belong in the low-variance portfolio. ### Mapping to ROOT's framework The default global objective function ROOT minimizes is as follows: $L(w, D) = \sqrt{\frac{\sum_i w_i \cdot v_i^2}{\left(\sum_i w_i\right)^2}}$ where $v_i^2$ (`vsq`) is a pseudo outcome value, known as the per-unit variance proxy in the optimization example. In the portfolio context, we set `vsq` to the historical return variance of each asset. ROOT then finds the binary weight $w(X) \in \{0, 1\}$, described as a decision tree over asset features such as beta and volatility that minimizes this quantity. Users can also supply the global objective function they wish to minimize as well. --- ## Simulating the Data We simulate 100 assets with two features: annualized volatility and market beta. Returns are generated as a market factor model, and per-asset return variance is computed from 1,000 simulated return observations. ```{r simulate} library(ROOT) set.seed(123) n_assets <- 100 # Asset features volatility <- runif(n_assets, 0.05, 0.40) # annualised volatility beta <- runif(n_assets, 0.5, 1.8) # market beta sector <- sample(c("Tech", "Finance", "Energy", "Health"), n_assets, replace = TRUE) # Simulate returns: r_i = beta_i * r_market + epsilon_i market <- rnorm(1000, 0.0005, 0.01) returns_mat <- sapply(seq_len(n_assets), function(i) beta[i] * market + rnorm(1000, 0, volatility[i] / sqrt(252)) ) # Per-asset return variance (the objective proxy ROOT will minimize) vsq <- apply(returns_mat, 2, var) dat_portfolio <- data.frame( vsq = vsq, vol = volatility, beta = beta, sector = as.integer(factor(sector)) ) head(dat_portfolio) ``` The `vsq` column is recognized by ROOT's default objective function as the per-unit variance proxy. The columns `vol`, `beta`, and `sector` are the splitting features available to the tree. ### Distribution of asset risk ```{r risk-plot, fig.width = 6, fig.height = 4, fig.alt = "Scatter plot of asset beta vs volatility"} plot( dat_portfolio$beta, dat_portfolio$vol, xlab = "Market beta", ylab = "Annualised volatility", pch = 16, col = "#4E79A7AA", main = "Asset universe: volatility vs beta" ) ``` We expect ROOT to identify the high-beta, high-volatility corner of this space as the region to exclude. --- ## Fitting ROOT ```{r fit, message = FALSE, warning = FALSE} portfolio_fit <- ROOT( data = dat_portfolio, num_trees = 20, top_k_trees = TRUE, k = 10, seed = 42 ) ``` ROOT grows 20 trees and selects the 10 with the lowest objective value as the Rashomon set. Final asset weights $w_{\text{opt}}$ are determined by majority vote across those 10 trees. --- ## Inspecting the Results ### Print summary ```{r print} print(portfolio_fit) ``` ### Detailed summary ```{r summary} summary(portfolio_fit) ``` The **Diagnostics** section confirms that 20 trees were grown, 10 were selected into the Rashomon set, and 96% of assets received $w_{\text{opt}} = 1$ (included in the portfolio). Only the 4 most risk-concentrated assets are screened out. --- ## Visualizing the Characterized Tree ```{r plot, fig.width = 7, fig.height = 5, fig.alt = "Characterized tree for portfolio selection"} plot(portfolio_fit) ``` The tree encodes a simple, actionable rule: - **`beta < 1.7`**: include the asset unconditionally (88 assets). - **`beta >= 1.7` and `vol < 0.33`**: include the asset despite high market sensitivity, because some risk may be controlled (8 assets). - **`beta >= 1.7` and `vol >= 0.33`**: exclude the asset — high market sensitivity combined with high volatility drives up portfolio variance (4 assets, $w = 0$). --- ## Examining the Weights The final weight assignments are stored in `portfolio_fit$D_rash$w_opt`. We can compare the included and excluded assets directly. ```{r weights} dat_portfolio$w_opt <- portfolio_fit$D_rash$w_opt # Summary statistics by inclusion decision included <- dat_portfolio[dat_portfolio$w_opt == 1, ] excluded <- dat_portfolio[dat_portfolio$w_opt == 0, ] cat("Included assets (w = 1):", nrow(included), "\n") cat(" Mean beta: ", round(mean(included$beta), 3), "\n") cat(" Mean volatility: ", round(mean(included$vol), 3), "\n\n") cat("Excluded assets (w = 0):", nrow(excluded), "\n") cat(" Mean beta: ", round(mean(excluded$beta), 3), "\n") cat(" Mean volatility: ", round(mean(excluded$vol), 3), "\n") ``` The excluded assets have substantially higher beta and volatility than the included ones, confirming that ROOT correctly targets the risk-concentrated corner of the asset universe. ### Visualizing the inclusion decision ```{r weights-plot, fig.width = 6, fig.height = 4, fig.alt = "Asset universe with inclusion decisions highlighted"} plot( dat_portfolio$beta, dat_portfolio$vol, xlab = "Market beta", ylab = "Annualised volatility", pch = ifelse(dat_portfolio$w_opt == 1, 16, 4), col = ifelse(dat_portfolio$w_opt == 1, "#4E79A7", "#E15759"), main = "Portfolio inclusion decisions" ) legend( "topleft", legend = c("w = 1 (included)", "w = 0 (excluded)"), pch = c(16, 4), col = c("#4E79A7", "#E15759"), bty = "n" ) ``` --- ## Using a Custom Objective Function ROOT's general optimization mode is not limited to the default variance objective. You can supply any function of the form `function(D) -> numeric` where `D` is the working data frame with a column `w` containing the current weight assignments. For example, suppose we want to minimize the **interquartile range** of portfolio returns rather than variance. We can define a custom objective: ```{r custom-obj, message = FALSE, warning = FALSE} iqr_objective <- function(D) { w <- D$w if (sum(w) == 0) return(Inf) # Weighted IQR: compute quantiles using the included assets only included_vsq <- D$vsq[w == 1] diff(quantile(included_vsq, probs = c(0.25, 0.75))) } portfolio_fit_iqr <- ROOT( data = dat_portfolio, global_objective_fn = iqr_objective, num_trees = 20, top_k_trees = TRUE, k = 10, seed = 112 ) ``` The custom objective illustrates ROOT's flexibility: any scalar-valued function of the weighted dataset can be used as the optimization target, and the resulting characterized tree would still be interpretable. --- ## Key Parameters | Parameter | Role | Default | |:---------------------|:---------------------------------------------------------------------|:---------------| | `num_trees` | Number of trees to grow in the forest | `10` | | `top_k_trees` | If `TRUE`, select the top `k` trees by objective value | `FALSE` | | `k` | Rashomon set size when `top_k_trees = TRUE` | `10` | | `cutoff` | Rashomon threshold when `top_k_trees = FALSE`; `"baseline"` uses objective at $w \equiv 1$ | `"baseline"` | | `vote_threshold` | Fraction of Rashomon-set trees that must vote $w = 1$ for inclusion | `2/3` | | `global_objective_fn`| Custom objective `function(D) -> numeric`; if `NULL`, uses default variance objective | `NULL` | | `seed` | Random seed for reproducibility | `NULL` | | `feature_est` | Feature importance method for split selection (`"Ridge"`, `"GBM"`, or custom) | `"Ridge"` | | `leaf_proba` | Controls tree depth by increasing the probability of stopping at a leaf | `0.25` | --- ## Reference Parikh, H., Ross, R. K., Stuart, E., & Rudolph, K. E. (2025). Who Are We Missing?: A Principled Approach to Characterizing the Underrepresented Population. *Journal of the American Statistical Association*, 120(551), 1414–1423.