--- title: "Introduction to RFmstate" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to RFmstate} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Overview **RFmstate** fits cause-specific random survival forests for multistate survival analysis with covariate-adjusted transition probabilities computed via product-integral. For each transient state, competing transitions are modeled by separate random forests, and patient-specific transition probability matrices are assembled from the predicted cumulative hazards using the product-integral formula. The package also provides a standalone Aalen-Johansen nonparametric estimator as a covariate-free baseline. This approach is particularly suited for clinical trial data where patients transition through discrete health states (e.g., response, progression, death) with right censoring and competing transitions. The package provides: - State space and transition structure definition - Wide-to-long data conversion for counting-process format - Cause-specific random forest fitting per origin state - Transition probability matrices via product-integral of predicted cumulative hazards - Aalen-Johansen nonparametric estimation (covariate-free baseline) - Per-transition feature importance - Bias-variance diagnostics with Brier score and concordance index - Comprehensive visualizations ## Quick Start ### 1. Define the Multistate Structure ```{r define-states} library(RFmstate) # Use the built-in clinical trial structure ms <- clinical_states() print(ms) ``` Or define a custom structure with any number of states (3 or more): ```{r custom-states, eval=FALSE} # A simple 3-state illness-death model ms_simple <- define_multistate( state_names = c("Healthy", "Sick", "Dead"), absorbing = "Dead", transitions = list( Healthy = c("Sick", "Dead"), Sick = c("Dead") ) ) # A 4-state model with recovery ms_recovery <- define_multistate( state_names = c("Healthy", "Sick", "Recovered", "Dead"), absorbing = "Dead", transitions = list( Healthy = c("Sick", "Dead"), Sick = c("Recovered", "Dead"), Recovered = c("Dead") ) ) ``` This pipeline works with any defined multistate structure. ### 2. Simulate Data ```{r simulate} set.seed(42) dat <- sim_clinical_data(n = 300, structure = ms) head(dat) ``` ### 3. Prepare Multistate Data Convert wide-format data to long format: ```{r prepare} msdata <- prepare_data( data = dat, id = "ID", structure = ms, time_map = list( Responded = "time_Responded", Unresponded = "time_Unresponded", Stabilized = "time_Stabilized", Progressed = "time_Progressed", Death = "time_Death" ), censor_col = "time_censored", covariates = c("age", "sex", "BMI", "treatment") ) print(msdata) ``` ### 4. Aalen-Johansen Nonparametric Baseline Compute nonparametric (covariate-free) transition probability estimates: ```{r aj} aj <- aalen_johansen(msdata) print(aj) ``` ```{r aj-plot, fig.cap="State occupation probabilities from Aalen-Johansen estimator"} plot(aj, type = "state_occupation") ``` ```{r aj-hazard, fig.cap="Nelson-Aalen cumulative hazards"} plot(aj, type = "cumulative_hazard") ``` ### 5. Fit Random Forest Model ```{r fit} fit <- rfmstate( msdata, covariates = c("age", "sex", "BMI", "treatment"), num.trees = 200, seed = 42 ) print(fit) ``` ### 6. Model Summary ```{r summary} s <- summary(fit) ``` ### 7. Feature Importance ```{r importance, fig.cap="Feature importance per transition"} imp <- importance(fit) print(imp) plot(imp, type = "barplot") ``` ```{r importance-heat, fig.cap="Feature importance heatmap"} plot(imp, type = "heatmap") ``` ### 8. Predict for New Patients ```{r predict, fig.cap="Predicted state occupation for two patient profiles"} newdata <- data.frame( age = c(50, 70), sex = c(0, 1), BMI = c(24, 32), treatment = c(1, 0) ) pred <- predict(fit, newdata = newdata, times = seq(10, 365, by = 10)) # Plot for patient 1 (young, treated) plot(pred, type = "state_occupation", subject = 1) # Plot for patient 2 (older, untreated) plot(pred, type = "state_occupation", subject = 2) ``` ### 9. Diagnostics ```{r diagnostics} diag <- diagnose(fit) print(diag) ``` ```{r diag-brier, fig.cap="Time-dependent Brier score"} plot(diag, type = "brier") ``` ```{r diag-concordance, fig.cap="Concordance index per transition"} plot(diag, type = "concordance") ``` ```{r diag-bv, fig.cap="Bias-variance decomposition"} plot(diag, type = "bias_variance") ``` ### 10. Transition Diagram ```{r diagram, fig.cap="Transition diagram with event counts"} plot_transition_diagram(ms, msdata) ``` ## Methodology ### Product-Integral Framework Both the nonparametric (Aalen-Johansen) and the random forest methods in this package compute transition probability matrices $P(s,t)$ using the product-integral: $$P(s,t) = \prod_{s < u \leq t} (I + dA(u))$$ where $dA(u)$ is a matrix of hazard increments at time $u$, with off-diagonal entries $dA_{hj}(u)$ representing the $h \to j$ transition hazard increment and diagonal entries $dA_{hh}(u) = -\sum_{j \neq h} dA_{hj}(u)$. The two methods differ in how the hazard increments $dA(u)$ are estimated. ### Aalen-Johansen Estimator (Nonparametric Baseline) The Aalen-Johansen (AJ) estimator is the nonparametric generalization of the Kaplan-Meier estimator to multistate models under the Markov assumption. It estimates hazard increments from the data directly via the Nelson-Aalen formula: $$d\hat{A}_{hj}(u) = \frac{dN_{hj}(u)}{Y_h(u)}$$ where $dN_{hj}(u)$ counts the observed $h \to j$ transitions at time $u$ and $Y_h(u)$ is the number at risk in state $h$ just before time $u$. This provides population-level transition probabilities without covariate adjustment and serves as a covariate-free baseline in the package. ### Random Forest Multistate Approach For covariate-adjusted predictions, we decompose the multistate model into per-origin-state competing risks problems: 1. **For each transient state** $h$, identify all outgoing transitions 2. **Fit a cause-specific RSF**: For each destination state $j$, fit a random survival forest treating transition $h \to j$ as the event of interest and all other transitions as censored 3. **Extract cumulative hazards**: Convert each forest's predicted survival function $\hat{S}_{hj}(t|\mathbf{x})$ to a cumulative hazard via $\hat{H}_{hj}(t|\mathbf{x}) = -\log \hat{S}_{hj}(t|\mathbf{x})$ 4. **Assemble via product-integral**: Compute hazard increments from the predicted cumulative hazards and apply the product-integral formula to obtain the full transition probability matrix $P(s,t|\mathbf{x})$ This approach leverages the flexibility of random forests to capture nonlinear covariate effects and interactions while maintaining the interpretability of the multistate framework. The product-integral step ensures that the resulting transition probability matrices are coherent (rows sum to one, non-negative entries). ### Diagnostics - **OOB Error**: Out-of-bag prediction error from the random forest ensemble - **C-index**: Concordance between predicted risk and observed event ordering - **Brier Score**: Calibrated prediction error combining bias and variance - **Bias-Variance Decomposition**: Systematic vs. random components of prediction error ## References - Aalen, O.O. & Johansen, S. (1978). An empirical transition matrix for non-homogeneous Markov chains based on censored observations. *Scandinavian Journal of Statistics*, 5(3), 141-150. - Ishwaran, H. et al. (2008). Random survival forests. *Annals of Applied Statistics*, 2(3), 841-860. - Putter, H., Fiocco, M. & Geskus, R.B. (2007). Tutorial in biostatistics: Competing risks and multi-state models. *Statistics in Medicine*, 26, 2389-2430.