--- title: "Random Forest Survival Analysis with ggRandomForests" author: "John Ehrlinger" date: today format: html: toc: true toc-depth: 3 html-math-method: mathjax code-fold: false bibliography: ggRandomForests.bib vignette: > %\VignetteIndexEntry{Random Forest Survival Analysis with ggRandomForests} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ::: {.callout-warning} ## Work in progress This vignette is under active development. Code examples and narrative may change before the next release. ::: ```{r setup, include=FALSE} knitr::opts_chunk$set( fig.align = "center", fig.width = 7, fig.height = 5, message = FALSE, warning = FALSE, comment = "#>" ) options(mc.cores = 1, rf.cores = 1) ``` # Introduction Random forests [@Breiman:2001] are a non-parametric ensemble method that requires no distributional assumptions on the relation between covariates and the response. Random survival forests (RSF) [@Ishwaran:2007a; @Ishwaran:2008] extend the method to right-censored, time-to-event data by growing trees with a log-rank splitting rule and aggregating Kaplan--Meier estimates within terminal nodes. The **randomForestSRC** package [@Ishwaran:RFSRC:2014] provides a unified implementation for survival, regression, and classification forests. **ggRandomForests** extracts tidy data objects from `rfsrc` fits and renders them with **ggplot2** [@Wickham:2009], making it straightforward to explore how a forest is constructed, which variables matter, and how the response depends on individual predictors. This vignette demonstrates a complete random survival forest workflow on the primary biliary cirrhosis (PBC) data set [@fleming:1991]: 1. **Data preparation and exploration** --- cleaning, EDA, Kaplan--Meier curves 2. **Growing the forest** --- fitting an RSF, checking convergence and OOB error 3. **Variable selection** --- VIMP and minimal depth via `max.subtree()` 4. **Dependence plots** --- variable dependence and partial dependence via `gg_variable()` and `gg_partial_rfsrc()` 5. **Variable interactions** --- conditioning plots and interactive 3-D partial dependence surfaces with **plotly** ```{r packages} library(ggplot2) library(dplyr) library(tidyr) library(randomForestSRC) library(survival) if (requireNamespace("ggRandomForests", quietly = TRUE)) { library(ggRandomForests) } else if (requireNamespace("pkgload", quietly = TRUE)) { pkgload::load_all(export_all = FALSE, helpers = FALSE, attach_testthat = FALSE) } else { stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.") } theme_set(theme_bw()) # Plotting constants event_marks <- c(1, 4) event_labels <- c("Censored", "Death") event_colors <- c("steelblue", "firebrick") ``` # Data: Primary Biliary Cirrhosis (PBC) The PBC study consists of 424 patients referred to Mayo Clinic between 1974 and 1984, of whom 312 were randomized into a trial of D-penicillamine (DPCA) versus placebo. The data is described in @fleming:1991 Chapter 0.2, with a proportional hazards model developed in Chapter 4.4. We use the copy bundled with **randomForestSRC**. ```{r data-load} data("pbc", package = "randomForestSRC") ``` ## Data cleaning We convert `days` to `years`, recode `treatment` as a factor, and coerce low-cardinality numeric columns (five or fewer unique values, including binary 0/1 indicators) to factors. We avoid converting binary columns to `logical` because `randomForestSRC::partial.rfsrc()` does not handle logical predictors correctly in survival forests. ```{r data-clean} pbc <- pbc |> mutate( years = days / 365.25, age = age / 365.25, treatment = factor( ifelse(treatment == 1, "DPCA", ifelse(treatment == 2, "Placebo", NA)), levels = c("DPCA", "Placebo") ) ) |> select(-days) # Low-cardinality numerics (including binary 0/1) to factor. # NOTE: do NOT convert to logical — partial.rfsrc() fails with logical # predictors in survival forests (randomForestSRC <= 3.5.1). # Exclude the response columns (status, years) from conversion. resp_cols <- c("status", "years") for (nm in setdiff(names(pbc), resp_cols)) { v <- pbc[[nm]] if (is.numeric(v) && !is.factor(v) && length(unique(v[!is.na(v)])) <= 5) { pbc[[nm]] <- factor(v) } } ``` ```{r variable-labels} # Human-readable labels for plot axes st_labs <- c( status = "Death Event", treatment = "Treatment", age = "Age (years)", sex = "Female", ascites = "Ascites", hepato = "Hepatomegaly", spiders = "Spiders", edema = "Edema (0, 0.5, 1)", bili = "Serum Bilirubin (mg/dl)", chol = "Serum Cholesterol (mg/dl)", albumin = "Albumin (gm/dl)", copper = "Urine Copper (ug/day)", alk.phos = "Alkaline Phosphatase (U/liter)", ast = "SGOT (U/ml)", trig = "Triglycerides (mg/dl)", platelet = "Platelets (per cubic ml/1000)", prothrombin = "Prothrombin Time (sec)", stage = "Histologic Stage", years = "Follow-up Time (years)" ) ``` ## Exploratory data analysis Good practice before modeling: scan categorical variables as stacked histograms over follow-up time, and continuous variables as scatter plots colored by event status. ```{r eda-categorical, fig.height=4, fig.cap="EDA for categorical variables. Bar color indicates factor level; white = missing."} cls <- sapply(pbc, class) cnt_idx <- which(cls %in% c("numeric", "integer")) fct_idx <- setdiff(seq_along(pbc), cnt_idx) fct_idx <- union(fct_idx, which(names(pbc) == "years")) dta_cat <- suppressWarnings( pbc[, fct_idx] |> pivot_longer(-years, names_to = "variable", values_to = "value", values_transform = list(value = as.character)) ) ggplot(dta_cat, aes(x = years, fill = value)) + geom_histogram(color = "black", binwidth = 1) + labs(y = "", x = st_labs["years"]) + scale_fill_brewer(palette = "RdBu", na.value = "white") + facet_wrap(~variable, scales = "free_y", nrow = 2) + theme(legend.position = "none") ``` ```{r eda-continuous, fig.height=4, fig.cap="EDA for continuous variables. Points colored by death event; rug marks indicate missing values."} cnt_idx <- union(cnt_idx, which(names(pbc) == "status")) dta_cont <- pbc[, cnt_idx] |> pivot_longer(c(-years, -status), names_to = "variable", values_to = "value") ggplot(dta_cont |> filter(!is.na(value)), aes(x = years, y = value, color = factor(status), shape = factor(status))) + geom_point(alpha = 0.4) + geom_rug(data = dta_cont |> filter(is.na(value)), color = "grey50") + labs(y = "", x = st_labs["years"], color = "Death", shape = "Death") + scale_color_manual(values = event_colors) + scale_shape_manual(values = event_marks) + facet_wrap(~variable, scales = "free_y", ncol = 4) + theme(legend.position = c(0.8, 0.2)) ``` Look for patterns of missingness (white bars, rug marks) and extreme values that fall outside the biological range. ## Kaplan--Meier survival by treatment We restrict to the 312 trial patients and construct KM curves with `gg_survival`. ```{r km-survival, fig.cap="KM survival estimates by treatment group with 95% confidence bands."} pbc_trial <- pbc |> filter(!is.na(treatment)) pbc_test <- pbc |> filter(is.na(treatment)) gg_km <- gg_survival(interval = "years", censor = "status", by = "treatment", data = pbc_trial, conf.int = 0.95) plot(gg_km) + labs(y = "Survival Probability", x = "Time (years)", color = "Treatment", fill = "Treatment") + theme(legend.position = c(0.2, 0.2)) + coord_cartesian(ylim = c(0, 1.01)) ``` The curves largely overlap, consistent with the original finding that DPCA offered no clear survival benefit over placebo [@fleming:1991]. ```{r km-cumhaz, fig.cap="KM cumulative hazard estimates by treatment group."} plot(gg_km, type = "cum_haz") + labs(y = "Cumulative Hazard", x = "Time (years)", color = "Treatment", fill = "Treatment") + theme(legend.position = c(0.2, 0.8)) + coord_cartesian(ylim = c(-0.02, 1.22)) ``` We can also stratify on continuous variables. Here we reproduce the bilirubin groupings from @fleming:1991 Figure 4.4.2: ```{r km-bili, fig.width=6, fig.cap="KM survival stratified by bilirubin groups."} pbc_bili <- pbc_trial |> mutate(bili_grp = cut(bili, breaks = c(0, 0.8, 1.3, 3.4, 29))) plot(gg_survival(interval = "years", censor = "status", by = "bili_grp", data = pbc_bili), error = "none") + labs(y = "Survival Probability", x = "Time (years)", color = "Bilirubin") ``` Higher bilirubin strongly predicts worse survival --- an effect the random forest will rediscover without any prior specification. # Growing a Random Survival Forest Several predictors in the PBC trial data contain missing values (cholesterol, copper, triglycerides, and others). We handle these with a two-step approach: first impute using `impute.rfsrc()`, which uses the random forest proximity structure to fill in missing values, then fit the survival forest on the complete imputed data. This keeps the fitted forest object free of imputation state, which is required for `partial.rfsrc()` to work correctly. ```{r rfsrc-fit} # Step 1: impute missing values via random forest proximity pbc_imputed <- impute.rfsrc(Surv(years, status) ~ ., data = pbc_trial, ntree = 500, nimpute = 2) # Step 2: grow the survival forest on the complete imputed data rfsrc_pbc <- rfsrc(Surv(years, status) ~ ., data = pbc_imputed, nsplit = 10, tree.err = TRUE, importance = TRUE) ``` The forest grew `r rfsrc_pbc$ntree` trees, splitting on `r rfsrc_pbc$mtry` randomly selected candidate variables at each node, and stopping at a minimum terminal node size of `r rfsrc_pbc$nodesize`. ## OOB error convergence ```{r error-plot, fig.height=4, fig.cap="OOB prediction error vs. number of trees."} plot(gg_error(rfsrc_pbc)) ``` The error stabilizes well before 1000 trees, indicating the forest is large enough for reliable predictions. ## OOB predicted survival ```{r rfsrc-predicted, fig.cap="OOB predicted survival curves. Blue = censored, red = death events."} gg_rsf <- plot(gg_rfsrc(rfsrc_pbc), alpha = 0.2) + scale_color_manual(values = event_colors) + theme(legend.position = "none") + labs(y = "Survival Probability", x = "Time (years)") + coord_cartesian(ylim = c(-0.01, 1.01)) gg_rsf ``` Each curve represents one patient's OOB prediction, extended to the last follow-up time. Red (death event) curves generally fall faster, confirming the forest discriminates between risk groups. ```{r rfsrc-by-treatment, fig.cap="Median predicted survival by treatment group with 95% confidence bands."} plot(gg_rfsrc(rfsrc_pbc, by = "treatment")) + theme(legend.position = c(0.2, 0.2)) + labs(y = "Survival Probability", x = "Time (years)") + coord_cartesian(ylim = c(-0.01, 1.01)) ``` ## Test set predictions The non-trial patients (`pbc_test`) have substantial missing data. `predict.rfsrc()` handles these transparently at prediction time via `na.action = "na.impute"` — this is distinct from training-time imputation and does not affect the fitted forest object. ```{r predict-test, fig.cap="Predicted survival for non-trial patients (test set)."} rfsrc_pbc_test <- predict(rfsrc_pbc, newdata = pbc_test, na.action = "na.impute", importance = TRUE) plot(gg_rfsrc(rfsrc_pbc_test), alpha = 0.2) + scale_color_manual(values = event_colors) + theme(legend.position = "none") + labs(y = "Survival Probability", x = "Time (years)") + coord_cartesian(ylim = c(-0.01, 1.01)) ``` Because the training curves use OOB estimates, both plots represent out-of-sample predictions and are directly comparable. # Variable Selection Random forest uses all available predictors. To understand which matter most, we examine variable importance (VIMP) and minimal depth. ## Variable importance (VIMP) VIMP measures the increase in prediction error when a variable is randomly permuted. Large positive values mean the variable is essential for accuracy; negative values suggest noise is more informative than the variable. ```{r vimp-plot, fig.width=6, fig.cap="Variable importance ranking. Blue = positive VIMP, red = negative."} plot(gg_vimp(rfsrc_pbc), lbls = st_labs) + theme(legend.position = c(0.8, 0.2)) + labs(fill = "VIMP > 0") ``` Bilirubin ranks highest, followed by copper, prothrombin time, albumin, and age --- closely matching the variables selected in the @fleming:1991 proportional hazards model. ## Minimal depth Minimal depth [@Ishwaran:2010] ranks variables by how close to the root node they first split, on average. Variables that partition large portions of the population early are considered most important. ```{r varsel} md_pbc <- max.subtree(rfsrc_pbc) ``` The `max.subtree()` function computes minimal depth for each variable. The threshold is `r round(md_pbc$threshold, 2)`, selecting `r length(md_pbc$topvars)` variables: `r paste(md_pbc$topvars, collapse = ", ")`. Both selection methods agree on the key predictors: `bili`, `albumin`, `copper`, `prothrombin`, and `age`. We add `edema` (selected by the @fleming:1991 model) for the remainder of the analysis. ```{r select-vars} xvar <- c("bili", "albumin", "copper", "prothrombin", "age") xvar_cat <- "edema" xvar_all <- c(xvar, xvar_cat) ``` # Variable Dependence ## Variable dependence plots Variable dependence shows each patient's predicted survival at a given time horizon plotted against a predictor of interest. Points are colored by event status; a loess smooth indicates the trend. ```{r vardep-bili, fig.height=4, fig.cap="Variable dependence of survival at 1 and 3 years on bilirubin."} gg_v <- gg_variable(rfsrc_pbc, time = c(1, 3), time.labels = c("1 Year", "3 Years")) plot(gg_v, xvar = "bili", alpha = 0.4) + labs(y = "Survival", x = st_labs["bili"]) + theme(legend.position = "none") + scale_color_manual(values = event_colors, labels = event_labels) + scale_shape_manual(values = event_marks, labels = event_labels) + coord_cartesian(ylim = c(-0.01, 1.01)) ``` Survival drops sharply with increasing bilirubin, and the 3-year curve drops further than the 1-year curve, suggesting a non-proportional hazards effect. ```{r vardep-panel, fig.height=5, fig.cap="Variable dependence at 1 and 3 years for continuous predictors."} plot(gg_v, xvar = xvar[-1], panel = TRUE, alpha = 0.4) + labs(y = "Survival") + theme(legend.position = "none") + scale_color_manual(values = event_colors, labels = event_labels) + scale_shape_manual(values = event_marks, labels = event_labels) + coord_cartesian(ylim = c(-0.05, 1.05)) ``` The plots confirm survival increases with albumin and decreases with copper, prothrombin time, and age. The divergence between time curves for `copper` further supports a non-proportional hazards mechanism. ```{r vardep-categorical, fig.height=4, fig.cap="Variable dependence on edema (categorical)."} plot(gg_v, xvar = xvar_cat, alpha = 0.4) + labs(y = "Survival") + theme(legend.position = "none") + scale_color_manual(values = event_colors, labels = event_labels) + scale_shape_manual(values = event_marks, labels = event_labels) + coord_cartesian(ylim = c(-0.01, 1.02)) ``` Patients with edema = 1 (edema despite diuretics) have markedly lower predicted survival. ## Partial dependence ::: {.callout-warning} **Known issue (draft):** `randomForestSRC::partial.rfsrc()` currently fails for survival forests in randomForestSRC ≥ 3.3. The partial dependence and interactive surface sections below will show an error until this upstream bug is resolved. All other sections of this vignette are fully functional. ::: Partial dependence integrates out the effects of other covariates, giving a risk-adjusted view of how each predictor influences the response [@Friedman:2000]. We use `gg_partial_rfsrc()` which calls `randomForestSRC::partial.rfsrc()` directly and returns a `gg_partial_rfsrc` object. For survival forests, the result includes a `time` column so `plot(pd)` produces one curve per predictor value over time, faceted by variable name. ```{r partial-dep, error=TRUE, fig.height=5, fig.cap="Partial dependence of survival at approximately 1 and 3 years on continuous predictors."} # partial.rfsrc() requires times that match the model's time.interest grid; # gg_partial_rfsrc() snaps the requested values to the nearest observed times. ti <- rfsrc_pbc$time.interest t1yr <- ti[which.min(abs(ti - 1))] t3yr <- ti[which.min(abs(ti - 3))] pd <- gg_partial_rfsrc(rfsrc_pbc, xvar.names = xvar, partial.time = c(t1yr, t3yr)) # Quick S3 plot — survival forests produce time-series curves per predictor value plot(pd) ``` For a publication-ready layout with custom colour scale, access `pd$continuous` directly: ```{r partial-dep-custom, error=TRUE, fig.height=5, fig.cap="Partial dependence (custom styling)."} ggplot(pd$continuous, aes(x = x, y = yhat, color = factor(round(time, 2)), group = factor(time))) + geom_line(linewidth = 1) + facet_wrap(~name, scales = "free_x") + labs(y = "Predicted Survival", x = "", color = "Year") + scale_color_manual(values = setNames(c("steelblue", "firebrick"), as.character(round(c(t1yr, t3yr), 2)))) + theme_bw() ``` The partial dependence curves at approximately 1 and 3 years confirm the variable dependence findings and support the log-transforms used in the @fleming:1991 model for `bili`, `albumin`, and `prothrombin`. The divergence between time curves for `bili` and `copper` supports non-proportional hazards. ## Conditional dependence To investigate interactions, we can condition variable dependence on group membership in another variable. Here we examine the dependence of survival on bilirubin, stratified by edema group. ```{r coplot-edema, fig.width=7, fig.height=4, fig.cap="Variable dependence on bilirubin, conditional on edema group."} gg_v1 <- gg_variable(rfsrc_pbc, time = 1) gg_v1$edema <- paste("edema =", gg_v1$edema) plot(gg_v1, xvar = "bili", alpha = 0.5) + labs(y = "Survival at 1 Year", x = st_labs["bili"]) + theme(legend.position = "none") + scale_color_manual(values = event_colors, labels = event_labels) + scale_shape_manual(values = event_marks, labels = event_labels) + facet_grid(~edema) + coord_cartesian(ylim = c(-0.01, 1.01)) ``` The decreasing trend with bilirubin holds across all edema groups, but survival is uniformly lower in the edema = 1 panel, confirming an additive effect. We can also condition on a continuous variable by binning into quantile groups: ```{r coplot-albumin, fig.width=7, fig.height=5, fig.cap="Variable dependence on bilirubin, conditional on albumin groups."} albumin_cts <- quantile_pts(gg_v1$albumin, groups = 6, intervals = TRUE) gg_v1$albumin_grp <- cut(gg_v1$albumin, breaks = albumin_cts) levels(gg_v1$albumin_grp) <- paste("albumin =", levels(gg_v1$albumin_grp)) plot(gg_v1, xvar = "bili", alpha = 0.5) + labs(y = "Survival at 1 Year", x = st_labs["bili"]) + theme(legend.position = "none") + scale_color_manual(values = event_colors, labels = event_labels) + scale_shape_manual(values = event_marks, labels = event_labels) + facet_wrap(~albumin_grp) + coord_cartesian(ylim = c(-0.01, 1.01)) ``` The effect of bilirubin attenuates at higher albumin levels, suggesting an interaction between these two liver function markers. # Interactive Partial Dependence Surfaces For a richer view of the interaction between bilirubin and albumin, we construct a partial dependence surface. We compute partial dependence on a grid of 25 albumin values, each evaluated at 25 bilirubin points. ```{r surface-data, error=TRUE, cache=FALSE} # Create grid of albumin values alb_grid <- quantile_pts(pbc_trial$albumin, groups = 25) # For each albumin value, compute partial dependence on bili at ~1 year surface_list <- lapply(alb_grid, function(alb_val) { newx <- pbc_trial[, rfsrc_pbc$xvar.names, drop = FALSE] newx$albumin <- alb_val pd_alb <- gg_partial_rfsrc(rfsrc_pbc, xvar.names = "bili", newx = newx, partial.time = t1yr) df <- pd_alb$continuous df$albumin <- alb_val df }) surface_df <- bind_rows(surface_list) ``` ```{r plotly-surface, error=TRUE, fig.cap="Interactive partial dependence surface: survival as a function of bilirubin and albumin."} if (!exists("surface_df")) { message("surface_df not available — skipping plotly surface (see surface-data chunk error above).") } else if (requireNamespace("plotly", quietly = TRUE)) { # Reshape for surface library(plotly) surface_wide <- surface_df |> select(bili = x, albumin, survival = yhat) |> arrange(albumin, bili) # Create matrix form bili_vals <- sort(unique(surface_wide$bili)) alb_vals <- sort(unique(surface_wide$albumin)) z_matrix <- matrix(surface_wide$survival, nrow = length(alb_vals), ncol = length(bili_vals), byrow = TRUE) plot_ly(x = bili_vals, y = alb_vals, z = z_matrix) |> add_surface(colorscale = "Viridis", showscale = TRUE) |> layout( scene = list( xaxis = list(title = "Bilirubin"), yaxis = list(title = "Albumin"), zaxis = list(title = "Survival") ) ) } else { message("Install the plotly package for interactive 3D surfaces.") # Fallback: contour plot with ggplot2 ggplot(surface_df, aes(x = x, y = albumin, fill = yhat)) + geom_tile() + scale_fill_viridis_c(name = "Survival") + labs(x = "Bilirubin", y = "Albumin") + theme_bw() } ``` The surface shows that survival is highest when bilirubin is low and albumin is high (upper-left corner), and drops steeply as bilirubin increases or albumin decreases. The non-planar shape of the surface --- particularly the steep gradient at low albumin and high bilirubin --- confirms the interaction detected in the conditional plots. # Conclusion This vignette demonstrated a complete random survival forest analysis using **randomForestSRC** and **ggRandomForests**: - **Kaplan--Meier curves** via `gg_survival()` confirmed no treatment effect in the PBC trial, consistent with @fleming:1991. - **OOB error** via `gg_error()` showed the forest stabilized quickly. - **VIMP** via `gg_vimp()` and **minimal depth** via `max.subtree()` both identified bilirubin, albumin, copper, prothrombin, and age as key predictors --- matching the proportional hazards model. - **Variable dependence** via `gg_variable()` revealed non-proportional hazards effects for bilirubin and copper. - **Partial dependence** via `gg_partial_rfsrc()` provided risk-adjusted confirmation and supported the log-transforms used in parametric models. - **Conditional plots** and **interactive surfaces** exposed the bilirubin--albumin interaction. The **ggRandomForests** design separates data extraction from plotting: every `gg_*()` function returns a tidy data frame that can be plotted with the package's `plot()` methods or fed directly into custom `ggplot2` workflows. # References