The purpose of the spect package is to provide a simple, flexible interface for modeling time-to-event data using a discrete-time approach that makes use of existing binary classification tools. Below is an instructive example using the Mayo clinic pbc data set from the “survival” package.
Note that the spect_train function requires the name of the column in the modeling data containing the event indicator variable (and the values in that column must be either zero or one) and the name of the column containing the time to the event for each individual. There are no positional constraints for these columns. All other columns will be treated as inputs to the model.
Below, we take a subset of the pbc input columns and calculate the event indicator and survival time in years.
set.seed(42)
data(pbc, package = "survival")
event_column <- "event_indicator"
time_column <- "survival_time"
pbc_complete <- pbc[complete.cases(pbc),]
source_data <- pbc_complete[,c(7:20)]
source_data[[event_column]] <- ifelse(pbc_complete$status == 2, 1, 0)
source_data[[time_column]] <- pbc_complete$time / 365.25
head(source_data)
#> ascites hepato spiders edema bili chol albumin copper alk.phos ast trig
#> 1 1 1 1 1.0 14.5 261 2.60 156 1718.0 137.95 172
#> 2 0 1 1 0.0 1.1 302 4.14 54 7394.8 113.52 88
#> 3 0 0 0 0.5 1.4 176 3.48 210 516.0 96.10 55
#> 4 0 1 1 0.5 1.8 244 2.54 64 6121.8 60.63 92
#> 5 0 1 1 0.0 3.4 279 3.53 143 671.0 113.15 72
#> 7 0 1 0 0.0 1.0 322 4.09 52 824.0 60.45 213
#> platelet protime stage event_indicator survival_time
#> 1 190 12.2 4 1 1.095140
#> 2 221 10.6 3 0 12.320329
#> 3 151 12.0 4 1 2.770705
#> 4 183 10.3 4 1 5.270363
#> 5 136 10.9 3 0 4.117728
#> 7 204 9.7 3 0 5.015743
Next, save a subset of the data to demonstrate predictions later.
Then, call the spect_train() function. Note that the cores parameter is not set and use_parallel is set to FALSE, specifically to avoid excessive stress on testing systems, however, it’s generally best to choose an appropriate number of cores or left at zero to allow spect to dynamically choose based on system availability.
result <- spect_train(model_algorithm = "rf"
, base_learner_list = c("glm", "svmLinear")
, use_parallel = FALSE
, modeling_data = train_data
, event_indicator_var = event_column
, survival_time_var = time_column
, obs_window = 12)
#> INFO [2025-04-06 20:25:33] Splitting test/train data at 0.200000/0.800000...
#> INFO [2025-04-06 20:25:33] Creating person-period data set...
#> INFO [2025-04-06 20:25:33] Creating caret train control...
#> INFO [2025-04-06 20:25:33] Training base models glm, svmLinear using repeatedcv method with 10 resamples and 3 kfold repeats
#> Warning in train.default(x, y, weights = w, ...): The metric "ROC" was not in
#> the result set. Accuracy will be used instead.
#> Loading required package: ggplot2
#> Loading required package: lattice
#> Warning in train.default(x, y, weights = w, ...): The metric "ROC" was not in
#> the result set. Accuracy will be used instead.
#> maximum number of iterations reached 0.002668032 0.002649569maximum number of iterations reached 0.001200352 0.001196387maximum number of iterations reached 0.004128868 0.004085047maximum number of iterations reached 0.005894409 0.005808958maximum number of iterations reached -0.0001578453 -0.0001578894maximum number of iterations reached 0.002781525 0.002766878maximum number of iterations reached 0.00382699 0.003795592maximum number of iterations reached 0.00423685 0.004202033maximum number of iterations reached 0.005475957 0.005409323maximum number of iterations reached 0.002354984 0.002342749maximum number of iterations reached 0.00305497 0.003008607maximum number of iterations reached 0.003272896 0.003243133maximum number of iterations reached 0.003677112 0.003654026maximum number of iterations reached 0.004014497 0.003970751maximum number of iterations reached 0.004131751 0.004049032maximum number of iterations reached 0.003069128 0.003055374maximum number of iterations reached 0.001908117 0.001902267maximum number of iterations reached 0.004899541 0.004792288maximum number of iterations reached 0.002402241 0.002386853maximum number of iterations reached 0.00264648 0.002631756maximum number of iterations reached 0.001807357 0.001800219maximum number of iterations reached 0.002053572 0.002043815maximum number of iterations reached 0.002518704 0.002503169maximum number of iterations reached 0.002345695 0.002334784maximum number of iterations reached 0.0008497007 0.0008480762maximum number of iterations reached 0.001148698 0.001143859maximum number of iterations reached 0.005440482 0.005331076maximum number of iterations reached 0.004314559 0.004262836maximum number of iterations reached 0.0007320378 0.000731329maximum number of iterations reached 0.0006210769 0.0006201139maximum number of iterations reached 0.001264579 0.001260133INFO [2025-04-06 20:25:39] Training the metalearner rf using Kappa metric...
#> note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
#>
#> INFO [2025-04-06 20:25:53] Calculating probabilities on 53 test individuals...
#> INFO [2025-04-06 20:25:53] Transforming 53 rows of data
The result object can be passed to other functions in spect to help visualize and evaluate the model. First, take a look at the survival curve for a given individual - let’s say number fifteen. The absolute survival probability is plotted in blue, and the conditional probability of surviving the given time slice, given that the individual survived to the beginning of that time slice is shown in orange.
To see only the conditional probability, set the curve_type to “conditional.” For only the absolute probability plot, set curve_type to “absolute.”
Next, build a population-based survival curve (the so-called Kaplan-Meier curve). Note that in order to create a curve, we need to turn the individual survival curves above into actual predicted survival times. To do that, we must choose a probability threshold, then project that onto the time axis at the point of intersection with the curve. The search granularity allows us to do this for a number of choices of the probability threshold and visualize the resulting population curves.
km_data <- plot_km(result, prediction_threshold_search_granularity = 0.1)
#> INFO [2025-04-06 20:25:53] Calculating plot increments...
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.100000
#> INFO [2025-04-06 20:25:53] Found 10 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.00 for threhold 0.10
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.200000
#> INFO [2025-04-06 20:25:53] Found 16 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.00 for threhold 0.20
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.300000
#> INFO [2025-04-06 20:25:53] Found 19 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.03 for threhold 0.30
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.400000
#> INFO [2025-04-06 20:25:53] Found 23 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.10 for threhold 0.40
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.500000
#> INFO [2025-04-06 20:25:53] Found 26 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.09 for threhold 0.50
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.600000
#> INFO [2025-04-06 20:25:53] Found 30 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.10 for threhold 0.60
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.700000
#> INFO [2025-04-06 20:25:53] Found 36 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.25 for threhold 0.70
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.800000
#> INFO [2025-04-06 20:25:53] Found 40 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.39 for threhold 0.80
#> INFO [2025-04-06 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.900000
#> INFO [2025-04-06 20:25:53] Found 49 individuals...
#> INFO [2025-04-06 20:25:53] Generate P-value...
#> INFO [2025-04-06 20:25:53] Pval is 0.58 for threhold 0.90
#> INFO [2025-04-06 20:25:53] Initializing Kaplan-Meier curve plot...
#> INFO [2025-04-06 20:25:53] Collecting data for each threshold value...
It is also often useful to see the ROC curve, however, because the modeling data contains one row per individual per time bin, we must choose a particular time at which to generate an ROC curve. Note that multiple times may be chose - a curve is generated for times 2 and 6 below.
prediction_times = c(2, 6)
eval <- evaluate_model(result, prediction_times)
#> INFO [2025-04-06 20:25:59] Collecting evaluation metric data...
Finally - it’s relatively easy to calculate quantities of interest such as the unconditional probability of an individual surviving to a particular time. Cumulative probability is already captured by the spect_predict() function in the abs_event_prob column. Note that the conditional probability of surviving an interval, given that the individual survived to the beginning of that interval is also available.
predictions <- spect_predict(result, new_data=predict_data)
#> INFO [2025-04-06 20:26:01] Transforming 10 rows of data
# Collect the absolute probability of individual 1 surviving to time 6.
individual = 1
survival_time_check = 6
tail(predictions[predictions$individual_id == individual & predictions$upper_bound
< survival_time_check,], n = 1)$abs_event_prob
#> [1] 0.9801006