Intrinsic variable selection

Introduction

In the main vignette, I discussed how to perform flexible variable selection in a simple, simulated example. In this document, I will discuss intrinsic selection in more detail.

Intrinsic variable selection is variable selection performed using intrinsic variable importance (Williamson and Huang 2023). Intrinsic variable importance is a summary of the true population distribution; it is the difference between the best possible prediction performance obtained using a set of variables and the best possible prediction performance obtained without using that set of variables (Williamson, Gilbert, Simon, et al. 2021; Williamson, Gilbert, Carone, et al. 2020). Prediction performance can be measured using R-squared, classification accuracy, area under the receiver operating characteristic curve (AUC), and binomial deviance. It can also be defined for a single variable by averaging the intrinsic importance of the variable compared to each other subset of variables (Williamson and Feng 2020), a notion that we use for variable selection.

Throughout this vignette, I will use a dataset inspired by data collected by the Early Detection Research Network (EDRN). Biomarkers developed at six “labs” are validated at at least one of four “validation sites” on 306 cysts. The data also include two binary outcome variables: whether or not the cyst was classified as mucinous, and whether or not the cyst was determined to have high malignant potential. The data contain some missing information, which complicates variable selection; only 212 cysts have complete information. In the first section, we will drop the missing data; in the second section, we will appropriately handle the missing data by using multiple imputation. I will use AUC to measure intrinsic importance.

# load the dataset
data("biomarkers")
library("dplyr")
# set up vector "y" of outcomes and matrix "x" of features
cc <- complete.cases(biomarkers)
y <- biomarkers$mucinous
y_cc <- y[cc]
x_cc <- biomarkers %>%
  na.omit() %>%
  select(starts_with("lab"), starts_with("cea"))
x_names <- names(x_cc)

Intrinsic variable selection

The first step in performing intrinsic variable selection is to estimate intrinsic variable importance. To estimate variable importance, I use the function sp_vim from the R package vimp (Williamson, Feng, Wolock, et al. 2023). This requires specifying a library of candidate learners for the Super Learner (Laan, Polley, and Hubbard 2007); throughout, I use a very simple library of learners for the Super Learner (this is for illustration only; in practice, I suggest using a large library of learners).

set.seed(1234)

# set up a library for SuperLearner; this is too simple a library for use in most applications
learners <- "SL.glm"
univariate_learners <- "SL.glm"
V <- 2

# estimate the SPVIMs
library("SuperLearner")
library("vimp")
#> vimp version 2.3.1: Perform Inference on Algorithm-Agnostic Variable Importance
est <- suppressWarnings(
  sp_vim(Y = y_cc, X = x_cc, V = V, type = "auc",
              SL.library = learners, gamma = .1, alpha = 0.05, delta = 0,
              cvControl = list(V = V), env = environment())
)
est
#> Variable importance estimates:
#>        Estimate     SE         95% CI          VIMP > 0 p-value  
#> s = 1  0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 2  4.215857e-02 0.16534232 [0, 0.36622356] FALSE    0.4131126
#> s = 3  0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 4  0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 5  6.187506e-05 0.06263558 [0, 0.12282535] FALSE    0.4998444
#> s = 6  0.000000e+00 0.11676553 [0, 0.21535200] FALSE    0.5000000
#> s = 7  0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 8  7.752945e-03 0.11357388 [0, 0.23035366] FALSE    0.4820035
#> s = 9  0.000000e+00 0.05425335 [0, 0.10041316] FALSE    0.5000000
#> s = 10 2.070649e-02 0.07993317 [0, 0.17737263] FALSE    0.4492960
#> s = 11 0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 12 0.000000e+00 0.05157831 [0, 0.10109163] FALSE    0.5000000
#> s = 13 6.410256e-03 0.02006046 [0, 0.04572804] FALSE    0.4832894
#> s = 14 0.000000e+00 0.06751652 [0, 0.12712006] FALSE    0.5000000
#> s = 15 0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 16 0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 17 0.000000e+00 0.05109337 [0, 0.09725160] FALSE    0.5000000
#> s = 18 4.462739e-04 0.06625191 [0, 0.13029763] FALSE    0.4988827
#> s = 19 0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 20 0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000
#> s = 21 0.000000e+00 0.04733026 [0, 0.09196796] FALSE    0.5000000

The next step is to choose an error rate to control and a method for controlling the family-wise error rate. Here, I choose the generalized family-wise error rate to control overall and choose Holm-adjusted p-values to control the individual family-wise error rate:

intrinsic_set <- intrinsic_selection(
  spvim_ests = est, sample_size = nrow(x_cc), alpha = 0.2, feature_names = x_names,
  control = list( quantity = "gFWER", base_method = "Holm", k = 1)
)
intrinsic_set
#> # A tibble: 21 × 6
#>    feature                       est p_value adjusted_p_value  rank selected
#>    <chr>                       <dbl>   <dbl>            <dbl> <dbl> <lgl>   
#>  1 lab1_actb               0           0.5                  1    14 FALSE   
#>  2 lab1_molecules_score    0.0422      0.413                1     1 TRUE    
#>  3 lab1_telomerase_score   0           0.5                  1    14 FALSE   
#>  4 lab2_fluorescence_score 0           0.5                  1    14 FALSE   
#>  5 lab3_muc3ac_score       0.0000619   0.500                1     6 FALSE   
#>  6 lab3_muc5ac_score       0           0.5                  1    14 FALSE   
#>  7 lab4_areg_score         0           0.5                  1    14 FALSE   
#>  8 lab4_glucose_score      0.00775     0.482                1     3 FALSE   
#>  9 lab5_mucinous_call      0           0.5                  1    14 FALSE   
#> 10 lab5_neoplasia_v1_call  0.0207      0.449                1     2 FALSE   
#> # ℹ 11 more rows

I could also choose to control the false discovery rate (FDR), again using Holm-adjusted p-values to control the individual family-wise error rate:

intrinsic_set_fdr <- intrinsic_selection(
  spvim_ests = est, sample_size = nrow(x_cc), alpha = 0.2, feature_names = x_names,
  control = list( quantity = "FDR", base_method = "Holm", k = 1)
)
intrinsic_set_fdr
#> # A tibble: 21 × 6
#>    feature                       est p_value adjusted_p_value  rank selected
#>    <chr>                       <dbl>   <dbl>            <dbl> <dbl> <lgl>   
#>  1 lab1_actb               0           0.5                  1    14 FALSE   
#>  2 lab1_molecules_score    0.0422      0.413                1     1 FALSE   
#>  3 lab1_telomerase_score   0           0.5                  1    14 FALSE   
#>  4 lab2_fluorescence_score 0           0.5                  1    14 FALSE   
#>  5 lab3_muc3ac_score       0.0000619   0.500                1     6 FALSE   
#>  6 lab3_muc5ac_score       0           0.5                  1    14 FALSE   
#>  7 lab4_areg_score         0           0.5                  1    14 FALSE   
#>  8 lab4_glucose_score      0.00775     0.482                1     3 FALSE   
#>  9 lab5_mucinous_call      0           0.5                  1    14 FALSE   
#> 10 lab5_neoplasia_v1_call  0.0207      0.449                1     2 FALSE   
#> # ℹ 11 more rows

Intrinsic selection with missing data

n_imp <- 2

To properly handle the missing data, we first perform multiple imputation. We use the R package mice (Buuren, Groothuis-Oudshoorn, and others 2023; van Buuren and Groothuis-Oudshoorn 2011), here with only 2 imputations (in practice, more imputations may be better).

library("mice")
set.seed(20231121)
mi_biomarkers <- mice::mice(data = biomarkers, m = n_imp, printFlag = FALSE)
imputed_biomarkers <- mice::complete(mi_biomarkers, action = "long") %>%
  rename(imp = .imp, id = .id)

We can perform intrinsic variable selection using the imputed data. First, we estimate variable importance for each imputed dataset.

set.seed(20231121)
est_lst <- lapply(as.list(1:n_imp), function(l) {
  this_x <- imputed_biomarkers %>%
    filter(imp == l) %>%
    select(starts_with("lab"), starts_with("cea"))
  this_y <- biomarkers$mucinous
  suppressWarnings(
    sp_vim(Y = this_y, X = this_x, V = V, type = "auc", 
    SL.library = learners, gamma = 0.1, alpha = 0.05, delta = 0,
    cvControl = list(V = V), env = environment())
  )
})

Next, we use Rubin’s rules (Rubin 2018) to combine the variable importance estimates, and use this to perform variable selection. Here, we control the generalized family-wise error rate at 5, using Holm-adjusted p-values to control the initial family-wise error rate.

intrinsic_set <- intrinsic_selection(
  spvim_ests = est_lst, sample_size = nrow(biomarkers),
  feature_names = x_names, alpha = 0.05, 
  control = list(quantity = "gFWER", base_method = "Holm", k = 5)
)
intrinsic_set

We select five variables, here those with the top-5 estimated variable importance. The point estimates and p-values have been computed using Rubin’s rules.

Buuren, S van, K Groothuis-Oudshoorn, and others. 2023. mice: Multivariate Imputation by Chained Equations. https://CRAN.R-project.org/package=mice.

Laan, MJ van der, EC Polley, and AE Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1): Online Article 25.

Rubin, Donald B. 2018. “Multiple Imputation.” In Flexible Imputation of Missing Data, Second Edition, 29–62. Chapman; Hall/CRC.

van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.

Williamson, BD, and J Feng. 2020. “Efficient Nonparametric Statistical Inference on Population Feature Importance Using Shapley Values.” In Proceedings of the 37th International Conference on Machine Learning, 119:10282–91. Proceedings of Machine Learning Research. http://proceedings.mlr.press/v119/williamson20a.html.

Williamson, BD, J Feng, C Wolock, et al. 2023. vimp: Perform Inference on Algorithm-Agnostic Variable Importance. https://CRAN.R-project.org/package=vimp.

Williamson, BD, PB Gilbert, M Carone, et al. 2020. “Nonparametric Variable Importance Assessment Using Machine Learning Techniques.” Biometrics.

Williamson, BD, PB Gilbert, NR Simon, et al. 2021. “A General Framework for Inference on Algorithm-Agnostic Variable Importance.” Journal of the American Statistical Association. https://doi.org/10.1080/01621459.2021.2003200.

Williamson, BD, and Y Huang. 2023. “Flexible Variable Selection in the Presence of Missing Data.” International Journal of Biostatistics. https://arxiv.org/abs/2202.12989.