cutpointr: Bootstrapping

Christian Thiele, Lorenz A. Kapsner

2022-04-13

Bootstrapping is implemented in cutpointr with two goals:

  1. Determine optimal cutpoints with bootstrapping (as an alternative to determining them without bootstrapping)
  2. Validate (any) cutpoint optimization with bootstrapping

This vignette will briefly go through some examples for both approaches.

Determine optimal cutpoints

Without bootstrapping: maximize_metric

As a first basic example, the cutpoint optimization will be demonstrated without any bootstrapping by maximizing the Youden-Index. Using the method maximize_metric, this is performed on the full data set:

library(cutpointr)
data(suicide)
opt_cut <- cutpointr(
    data = suicide,
    x = dsi,
    class = suicide,
    method = maximize_metric,
    metric = youden,
    pos_class = "yes",
    direction = ">="
)
summary(opt_cut)
## Method: maximize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint youden    acc sensitivity specificity tp fn fp  tn
##                 2 0.7518 0.8647      0.8889      0.8629 32  4 68 428
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0

The fields in the resulting R object opt_cut are to be interpreted as follows:

Bootstrap cutpoints: maximize_boot_metric

The determination of the optimal cutpoint can also be performed using bootstrapping. Therefore, the methods maximize_boot_metric/minimize_boot_metric need to be chosen. These functions provide further arguments that can be used to configure the bootstrapping. These arguments can be viewed with help("maximize_boot_metric", "cutpointr"). The most important arguments are:

The cutpoint is optimized in n=boot_cut bootstrap samples by maximizing/ minimizing the respective metric (e.g., the Youden-index in this example) in each of these bootstrap samples. Finally, the summary function is applied to aggregate the optimal cutpoints from the n=boot_cut bootstrap samples into one final ‘optimal’ cutpoint.

set.seed(123)
opt_cut <- cutpointr(
    data = suicide,
    x = dsi,
    class = suicide,
    method = maximize_boot_metric,
    boot_cut = 200,
    summary_func = mean,
    metric = youden,
    pos_class = "yes",
    direction = ">="
)
summary(opt_cut)
## Method: maximize_boot_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint youden    acc sensitivity specificity tp fn fp  tn
##             2.055 0.6927 0.8816      0.8056      0.8871 29  7 56 440
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0

The fields in the resulting R object opt_cut are to be interpreted as follows:

Validate cutpoint optimization with bootstrapping

Any chosen methods to find the optimal cutpoints can be subsequently validated with bootstrapping. This can easily be activated by setting the argument boot_runs > 0. Please be aware that the first steps to calculate the optimal cutpoints with the specified method (as described above) will be performed in the very same manner as above, resulting in the same outputs as above (depending on the seed when bootstrapping cutpoints).

However, the method to calculate the optimal cutpoints will then additionally be performed on n=boot_runs bootstrap samples. For each of these bootstrap samples, several metrics and performance measures are available from the resulting $boot object, both for the in-bag (suffix: ‘_b’) and the out-of-bag (suffix: ‘_oob’) bootstrap samples. Please note that the optimal cutpoint is determined on the in-bag samples only and then just applied to the out-of-bag samples for validation purposes, so its value is available only once in the $boot object without a suffix.

maximize_metric

opt_cut <- cutpointr(
    data = suicide,
    x = dsi,
    class = suicide,
    method = maximize_metric,
    metric = youden,
    pos_class = "yes",
    direction = ">=",
    boot_runs = 100
)
## Running bootstrap...

The interpretation of fields in the resulting R object opt_cut is the same as above. The results from the bootstrapping are available from $boot.

summary(opt_cut)
## Method: maximize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## Nr. of bootstraps: 100 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint youden    acc sensitivity specificity tp fn fp  tn
##                 2 0.7518 0.8647      0.8889      0.8629 32  4 68 428
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
## 
## Bootstrap summary: 
##          Variable Min.   5% 1st Qu. Median Mean 3rd Qu.  95% Max.   SD NAs
##  optimal_cutpoint 1.00 1.00    2.00   2.00 2.08    2.00 4.00 4.00 0.69   0
##             AUC_b 0.85 0.89    0.90   0.92 0.92    0.94 0.96 0.97 0.02   0
##           AUC_oob 0.82 0.86    0.91   0.93 0.92    0.95 0.97 0.98 0.04   0
##          youden_b 0.60 0.67    0.72   0.75 0.75    0.79 0.85 0.89 0.06   0
##        youden_oob 0.49 0.58    0.67   0.73 0.72    0.78 0.84 0.87 0.08   0
##             acc_b 0.74 0.77    0.86   0.87 0.86    0.88 0.91 0.92 0.04   0
##           acc_oob 0.74 0.77    0.84   0.86 0.86    0.88 0.90 0.92 0.04   0
##     sensitivity_b 0.76 0.82    0.86   0.89 0.90    0.93 0.97 1.00 0.05   0
##   sensitivity_oob 0.60 0.69    0.81   0.87 0.86    0.92 1.00 1.00 0.09   0
##     specificity_b 0.72 0.76    0.85   0.87 0.86    0.88 0.91 0.92 0.04   0
##   specificity_oob 0.73 0.76    0.84   0.86 0.86    0.88 0.91 0.93 0.04   0
##    cohens_kappa_b 0.19 0.25    0.38   0.43 0.41    0.46 0.52 0.56 0.07   0
##  cohens_kappa_oob 0.15 0.25    0.34   0.39 0.39    0.44 0.49 0.56 0.08   0
opt_cut$boot[[1]] |> 
  head()
## # A tibble: 6 x 23
##   optimal_cutpoint AUC_b AUC_oob youden_b youden_oob acc_b acc_oob sensitivity_b
##              <dbl> <dbl>   <dbl>    <dbl>      <dbl> <dbl>   <dbl>         <dbl>
## 1                2 0.891   0.95     0.732      0.698 0.852   0.863         0.882
## 2                1 0.912   0.969    0.705      0.753 0.774   0.769         0.943
## 3                2 0.902   0.934    0.718      0.780 0.842   0.918         0.879
## 4                2 0.892   0.961    0.662      0.808 0.842   0.880         0.818
## 5                2 0.893   0.966    0.701      0.818 0.850   0.909         0.851
## 6                2 0.941   0.909    0.788      0.755 0.891   0.843         0.897
## # ... with 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
## #   specificity_oob <dbl>, cohens_kappa_b <dbl>, cohens_kappa_oob <dbl>,
## #   TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>, TP_oob <dbl>, FP_oob <dbl>,
## #   TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>

maximize_boot_metric

When bootstrapping cutpoints and also using the validation with bootstrapping, the optimal cutpoint will again first be determined as above in n=boot_cut bootstrap samples by maximizing/ minimizing the respective metric in each of these bootstrap samples and then by applying the summary function to aggregate the optimal cutpoints from the n=boot_cut bootstrap samples into one final ‘optimal’ cutpoint. Hence, using the same seeds here results in the same outputs as above, where no outer bootstrapping is applied.

In the validation routine, the chosen cutpoint optimization is then repeated in each of the n=boot_runs (outer) bootstrap samples: the optimal cutpoint is determined in each bootstrap sample by optimizing the metric on n=boot_cut (inner) bootstrap samples and applying the summary_func to aggregate them into one value.

Since the (inner) bootstrapping of optimal cutpoints is performed in each of the (outer) validation bootstrap samples, this can be computational very expensive and take some time to finish. Therefore, parallelization is implemented in cutpointr by just setting its argument allowParallel = TRUE and initializing a parallel environment.

library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(doRNG)
## Loading required package: rngtools
cl <- makeCluster(2) # 2 cores
registerDoParallel(cl)
registerDoRNG(12)
set.seed(123)
opt_cut <- cutpointr(
    data = suicide,
    x = dsi,
    class = suicide,
    method = maximize_boot_metric,
    boot_cut = 200,
    summary_func = mean,
    metric = youden,
    pos_class = "yes",
    direction = ">=",
    boot_runs = 100,
    allowParallel = TRUE
)
## Running bootstrap...
stopCluster(cl)

Again, the interpretation of fields in the resulting R object opt_cut is the same as above. The results from the bootstrapping are available from $boot.

summary(opt_cut)
## Method: maximize_boot_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## Nr. of bootstraps: 100 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint youden    acc sensitivity specificity tp fn fp  tn
##             2.055 0.6927 0.8816      0.8056      0.8871 29  7 56 440
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
## 
## Bootstrap summary: 
##          Variable Min.   5% 1st Qu. Median Mean 3rd Qu.  95% Max.   SD NAs
##  optimal_cutpoint 1.07 1.60    1.93   2.08 2.16    2.28 2.97 3.60 0.45   0
##             AUC_b 0.86 0.89    0.91   0.93 0.93    0.94 0.96 0.96 0.02   0
##           AUC_oob 0.84 0.88    0.90   0.92 0.92    0.95 0.97 0.98 0.03   0
##          youden_b 0.60 0.63    0.68   0.72 0.72    0.76 0.79 0.84 0.05   0
##        youden_oob 0.48 0.57    0.64   0.69 0.71    0.78 0.84 0.88 0.09   0
##             acc_b 0.83 0.85    0.87   0.88 0.88    0.89 0.91 0.93 0.02   0
##           acc_oob 0.83 0.84    0.86   0.88 0.88    0.89 0.91 0.92 0.02   0
##     sensitivity_b 0.71 0.75    0.80   0.83 0.84    0.87 0.91 0.94 0.05   0
##   sensitivity_oob 0.58 0.69    0.75   0.81 0.83    0.91 1.00 1.00 0.10   0
##     specificity_b 0.83 0.85    0.87   0.88 0.88    0.89 0.91 0.94 0.02   0
##   specificity_oob 0.82 0.83    0.87   0.88 0.88    0.90 0.92 0.93 0.03   0
##    cohens_kappa_b 0.32 0.33    0.38   0.42 0.43    0.47 0.54 0.59 0.06   0
##  cohens_kappa_oob 0.22 0.31    0.37   0.41 0.41    0.47 0.53 0.56 0.07   0
opt_cut$boot[[1]] |> 
  head()
## # A tibble: 6 x 23
##   optimal_cutpoint AUC_b AUC_oob youden_b youden_oob acc_b acc_oob sensitivity_b
##              <dbl> <dbl>   <dbl>    <dbl>      <dbl> <dbl>   <dbl>         <dbl>
## 1             2.34 0.931   0.950    0.729      0.613 0.870   0.892         0.857
## 2             2.58 0.948   0.896    0.725      0.628 0.883   0.867         0.839
## 3             1.70 0.939   0.889    0.784      0.711 0.872   0.846         0.914
## 4             1.95 0.894   0.962    0.680      0.844 0.861   0.851         0.816
## 5             2.03 0.906   0.963    0.692      0.786 0.885   0.872         0.8  
## 6             2.58 0.928   0.881    0.771      0.540 0.900   0.860         0.868
## # ... with 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
## #   specificity_oob <dbl>, cohens_kappa_b <dbl>, cohens_kappa_oob <dbl>,
## #   TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>, TP_oob <dbl>, FP_oob <dbl>,
## #   TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>

Some visualizations of the bootstrapping results are available with the plot function:

plot(opt_cut)

The two plots in the lower half can be generated separately with plot_cut_boot(opt_cut) and plot_metric_boot(opt_cut).