---
title: "Introduction to RFmstate"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to RFmstate}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## Overview

**RFmstate** fits cause-specific random survival forests for multistate
survival analysis with covariate-adjusted transition probabilities computed
via product-integral. For each transient state, competing transitions are
modeled by separate random forests, and patient-specific transition
probability matrices are assembled from the predicted cumulative hazards
using the product-integral formula. The package also provides a standalone
Aalen-Johansen nonparametric estimator as a covariate-free baseline. This
approach is particularly suited for clinical trial data where patients
transition through discrete health states (e.g., response, progression,
death) with right censoring and competing transitions.

The package provides:

- State space and transition structure definition
- Wide-to-long data conversion for counting-process format
- Cause-specific random forest fitting per origin state
- Transition probability matrices via product-integral of predicted
  cumulative hazards
- Aalen-Johansen nonparametric estimation (covariate-free baseline)
- Per-transition feature importance
- Bias-variance diagnostics with Brier score and concordance index
- Comprehensive visualizations

## Quick Start

### 1. Define the Multistate Structure

```{r define-states}
library(RFmstate)

# Use the built-in clinical trial structure
ms <- clinical_states()
print(ms)
```

Or define a custom structure with any number of states (3 or more):

```{r custom-states, eval=FALSE}
# A simple 3-state illness-death model
ms_simple <- define_multistate(
  state_names = c("Healthy", "Sick", "Dead"),
  absorbing = "Dead",
  transitions = list(
    Healthy = c("Sick", "Dead"),
    Sick = c("Dead")
  )
)

# A 4-state model with recovery
ms_recovery <- define_multistate(
  state_names = c("Healthy", "Sick", "Recovered", "Dead"),
  absorbing = "Dead",
  transitions = list(
    Healthy = c("Sick", "Dead"),
    Sick = c("Recovered", "Dead"),
    Recovered = c("Dead")
  )
)
```

This pipeline works with any defined multistate structure.

### 2. Simulate Data

```{r simulate}
set.seed(42)
dat <- sim_clinical_data(n = 300, structure = ms)
head(dat)
```

### 3. Prepare Multistate Data

Convert wide-format data to long format:

```{r prepare}
msdata <- prepare_data(
  data = dat, id = "ID", structure = ms,
  time_map = list(
    Responded = "time_Responded",
    Unresponded = "time_Unresponded",
    Stabilized = "time_Stabilized",
    Progressed = "time_Progressed",
    Death = "time_Death"
  ),
  censor_col = "time_censored",
  covariates = c("age", "sex", "BMI", "treatment")
)
print(msdata)
```

### 4. Aalen-Johansen Nonparametric Baseline

Compute nonparametric (covariate-free) transition probability estimates:

```{r aj}
aj <- aalen_johansen(msdata)
print(aj)
```

```{r aj-plot, fig.cap="State occupation probabilities from Aalen-Johansen estimator"}
plot(aj, type = "state_occupation")
```

```{r aj-hazard, fig.cap="Nelson-Aalen cumulative hazards"}
plot(aj, type = "cumulative_hazard")
```

### 5. Fit Random Forest Model

```{r fit}
fit <- rfmstate(
  msdata,
  covariates = c("age", "sex", "BMI", "treatment"),
  num.trees = 200,
  seed = 42
)
print(fit)
```

### 6. Model Summary

```{r summary}
s <- summary(fit)
```

### 7. Feature Importance

```{r importance, fig.cap="Feature importance per transition"}
imp <- importance(fit)
print(imp)
plot(imp, type = "barplot")
```

```{r importance-heat, fig.cap="Feature importance heatmap"}
plot(imp, type = "heatmap")
```

### 8. Predict for New Patients

```{r predict, fig.cap="Predicted state occupation for two patient profiles"}
newdata <- data.frame(
  age = c(50, 70),
  sex = c(0, 1),
  BMI = c(24, 32),
  treatment = c(1, 0)
)

pred <- predict(fit, newdata = newdata, times = seq(10, 365, by = 10))

# Plot for patient 1 (young, treated)
plot(pred, type = "state_occupation", subject = 1)

# Plot for patient 2 (older, untreated)
plot(pred, type = "state_occupation", subject = 2)
```

### 9. Diagnostics

```{r diagnostics}
diag <- diagnose(fit)
print(diag)
```

```{r diag-brier, fig.cap="Time-dependent Brier score"}
plot(diag, type = "brier")
```

```{r diag-concordance, fig.cap="Concordance index per transition"}
plot(diag, type = "concordance")
```

```{r diag-bv, fig.cap="Bias-variance decomposition"}
plot(diag, type = "bias_variance")
```

### 10. Transition Diagram

```{r diagram, fig.cap="Transition diagram with event counts"}
plot_transition_diagram(ms, msdata)
```

## Methodology

### Product-Integral Framework

Both the nonparametric (Aalen-Johansen) and the random forest methods in
this package compute transition probability matrices $P(s,t)$ using the
product-integral:

$$P(s,t) = \prod_{s < u \leq t} (I + dA(u))$$

where $dA(u)$ is a matrix of hazard increments at time $u$, with off-diagonal
entries $dA_{hj}(u)$ representing the $h \to j$ transition hazard increment
and diagonal entries $dA_{hh}(u) = -\sum_{j \neq h} dA_{hj}(u)$. The two
methods differ in how the hazard increments $dA(u)$ are estimated.

### Aalen-Johansen Estimator (Nonparametric Baseline)

The Aalen-Johansen (AJ) estimator is the nonparametric generalization of the
Kaplan-Meier estimator to multistate models under the Markov assumption. It
estimates hazard increments from the data directly via the Nelson-Aalen
formula:

$$d\hat{A}_{hj}(u) = \frac{dN_{hj}(u)}{Y_h(u)}$$

where $dN_{hj}(u)$ counts the observed $h \to j$ transitions at time $u$ and
$Y_h(u)$ is the number at risk in state $h$ just before time $u$. This
provides population-level transition probabilities without covariate
adjustment and serves as a covariate-free baseline in the package.

### Random Forest Multistate Approach

For covariate-adjusted predictions, we decompose the multistate model into
per-origin-state competing risks problems:

1. **For each transient state** $h$, identify all outgoing transitions
2. **Fit a cause-specific RSF**: For each destination state $j$, fit a
   random survival forest treating transition $h \to j$ as the event of
   interest and all other transitions as censored
3. **Extract cumulative hazards**: Convert each forest's predicted survival
   function $\hat{S}_{hj}(t|\mathbf{x})$ to a cumulative hazard via
   $\hat{H}_{hj}(t|\mathbf{x}) = -\log \hat{S}_{hj}(t|\mathbf{x})$
4. **Assemble via product-integral**: Compute hazard increments from the
   predicted cumulative hazards and apply the product-integral formula to
   obtain the full transition probability matrix $P(s,t|\mathbf{x})$

This approach leverages the flexibility of random forests to capture
nonlinear covariate effects and interactions while maintaining the
interpretability of the multistate framework. The product-integral step
ensures that the resulting transition probability matrices are coherent
(rows sum to one, non-negative entries).

### Diagnostics

- **OOB Error**: Out-of-bag prediction error from the random forest ensemble
- **C-index**: Concordance between predicted risk and observed event ordering
- **Brier Score**: Calibrated prediction error combining bias and variance
- **Bias-Variance Decomposition**: Systematic vs. random components of
  prediction error

## References

- Aalen, O.O. & Johansen, S. (1978). An empirical transition matrix for
  non-homogeneous Markov chains based on censored observations.
  *Scandinavian Journal of Statistics*, 5(3), 141-150.
- Ishwaran, H. et al. (2008). Random survival forests.
  *Annals of Applied Statistics*, 2(3), 841-860.
- Putter, H., Fiocco, M. & Geskus, R.B. (2007). Tutorial in biostatistics:
  Competing risks and multi-state models. *Statistics in Medicine*, 26,
  2389-2430.