Evaluation and explanation

Alex Zwanenburg

2026-04-22

library(familiar)
library(data.table)

set.seed(19)

Familiar will automatically evaluate created models, and export corresponding tables and plots. The table below lists all available analyses. Several analyses allow for determining distributions and confidence intervals to provide an indication of the value spread for a given dataset and models (if applicable). The estimation_type argument defines this behaviour:

Another argument, detail_level determines how the required point estimates are formed:

The more point estimates are required, the more computationally intensive the analysis is. Some analyses also become computationally more expensive with more samples. The sample_limit argument can be used to specify the maximum number of samples that should be used.

Some analyses, such as model predictions and individual conditional expectation plots, do not benefit from bootstraps as they are solely based on values predicted by the (ensemble) models. If you want to compute bias-corrected estimates or bootstrap confidence intervals for these analyses you should ensure that sufficient models are created. Experimental designs such as experimental_design="bs(fs+mb,400)+ev" allow for this, but are computationally expensive.

Several methods do not require any more information than that provided in a prediction table or dataset.

Evaluation and explanation steps.
a Available for binomial and multinomial outcomes.
b Available for survival outcomes.
c Available for all outcomes.
d Estimation types other than point require sufficient models.
e May not be available for all models.
Name Plot function Export function Est. type Detail level Sample limit Model Pred. table Dataset
AUC-PRa plot_auc_precision_recall_curve export_auc_data × × × ×
AUC-ROCa plot_auc_roc_curve export_auc_data × × × ×
Calibration infoc export_calibration_info × ×
Confusion matrixa plot_confusion_matrix export_confusion_matrix_data × × ×
Decision curve analysisab plot_decision_curve export_decision_curve_analysis_data × × × ×
Feature expressionc plot_sample_clustering export_feature_expressions × ×
Feature selection variable importancec plot_feature_selection_occurrence; plot_feature_selection_variable_importance export_fs_vimp ×
Feature similarityc plot_feature_similarity export_feature_similarity × × ×
Hyperparametersc export_hyperparameters ×
Individual conditional expectationc plot_ice export_ice_data ×d × × ×
Model calibrationce plot_calibration_data export_calibration_data × × ×
Model performancec plot_model_performance export_model_performance × × × ×
Model predictionsc export_prediction_data ×d × × x
Model variable importancece plot_model_signature_occurrence; plot_model_signature_variable_importance export_model_vimp ×
Partial dependencec plot_pd export_partial_dependence_data ×d × × ×
Permutation variable importancec plot_permutation_variable_importance export_permutation_vimp × × ×
Risk stratification infob export_risk_stratification_info × ×
Risk stratification datab plot_kaplan_meier export_risk_stratification_data × ×
Sample similarityc export_sample_similarity × × × ×
SHAP values plot_shap_summary, plot_shap_waterfall, plot_shap_force, plot_shap_dependence export_shap × × ×
Univariate analysisc plot_univariate_importance export_univariate_analysis_data ×

When familiar is run from summon_familiar plots and tables are exported automatically. The corresponding plot and export functions can all be used externally as well. This vignette shows their use.

We will first create two models. The first model aims to predict the risk for ischemic stroke based on clinical and image features. This is a classification problem. For illustration purposes, we will use the first 99 samples as development data, and the remaining as a holdout set.

ischemic_stroke_data <- data.table::as.data.table(modeldata::ischemic_stroke)

# Set categorical variables.
ischemic_stroke_data$male <- factor(
  ischemic_stroke_data$male, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$smoking_history <- factor(
  ischemic_stroke_data$smoking_history, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$atrial_fibrillation <- factor(
  ischemic_stroke_data$atrial_fibrillation, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$coronary_artery_disease <- factor(
  ischemic_stroke_data$coronary_artery_disease, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$diabetes_history  <- factor(
  ischemic_stroke_data$diabetes_history, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$hypercholesterolemia_history   <- factor(
  ischemic_stroke_data$hypercholesterolemia_history, levels = c(0, 1), labels = c("no", "yes")
)
ischemic_stroke_data$hypertension_history  <- factor(
  ischemic_stroke_data$hypertension_history, levels = c(0, 1), labels = c("no", "yes")
)

# Create training and hold-out test set.
ischemic_stroke_data_train <- ischemic_stroke_data[1L:99L, ]
ischemic_stroke_data <- ischemic_stroke_data[100L:126L, ]

# Call familiar to create models. We don't use summon_familiar here, but call
# train_familiar instead, which returns familiar models.
ischemic_stroke_model <- familiar::train_familiar(
  data = ischemic_stroke_data_train,
  outcome_column = "stroke",
  class_levels = c("no", "yes"),
  vimp_method = "none",
  learner = "random_forest_ranger_default",
  parallel = FALSE,
  verbose = FALSE
)

The second model is a survival model, and is trained on a colon chemotherapy clinical trial dataset (Moertel et al. 1995) to predict recurrence after treatment.

survival_data <- data.table::as.data.table(survival::colon)

# Select data corresponding to tumour recurrence.
survival_data <- survival_data[etype == 1L]

# Remove unncessary columns.
survival_data[, ":=" ("id" = NULL, "study" = NULL, "etype" = NULL, "node4" = NULL)]

# Set categorical variables.
survival_data$sex <- factor(
  survival_data$sex, levels = c(0, 1), labels = c("female", "male")
)
survival_data$obstruct <- factor(
  survival_data$obstruct, levels = c(0, 1), labels = c("no", "yes")
)
survival_data$adhere <- factor(
  survival_data$adhere, levels = c(0, 1), labels = c("no", "yes")
)
survival_data$differ <- factor(
  survival_data$differ,
  levels = c(1, 2, 3),
  labels = c("well", "moderate", "poor"),
  ordered = TRUE
)
survival_data$extent <- factor(
  survival_data$extent,
  levels = c(1, 2, 3, 4),
  labels = c("submucosa", "muscle", "serosa", "contiguous structures"),
  ordered = TRUE
)
survival_data$surg <- factor(
  survival_data$surg, levels = c(0, 1), labels = c("short", "long")
)

# Create training and hold-out test set.
survival_data_train <- survival_data[1L:799L, ]
survival_data <- survival_data[800L:929L, ]

survival_model <- familiar::train_familiar(
  data = survival_data_train,
  outcome_column = c("status", "time"),
  vimp_method = "none",
  learner = "random_forest_ranger_default",
  parallel = FALSE,
  verbose = FALSE
)

Model performance

We assess model performance to evaluate accuracy of a model. Model accuracy is quantified using performance metrics that are documented in the Performance metrics vignette, and can be set using the evaluation_metric argument of the summon_familiar function or metric for plot and export functions. For example, this results in the following plot for the ischemic stroke dataset:

plots <- familiar::plot_model_performance(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  metric = c("auc", "brier", "f1_score"),
  x_axis_by = "metric"
)
Performance of the ischemic stroke model on hold-out test data.
 The model is a classifier, and was assessed using three metrics:
 area under the receiver-operating characteristic curve (AUC), the Brier 
 score, and the F1-score. Higher AUC and F1-scores, and lower Brier score, 
 indicate better models. The graph shows the distribution of performance 
 metrics obtained from bootstraps of the dataset, along with 2.5th, 50.0th
 and 97.5th percentiles.

Performance of the ischemic stroke model on hold-out test data. The model is a classifier, and was assessed using three metrics: area under the receiver-operating characteristic curve (AUC), the Brier score, and the F1-score. Higher AUC and F1-scores, and lower Brier score, indicate better models. The graph shows the distribution of performance metrics obtained from bootstraps of the dataset, along with 2.5th, 50.0th and 97.5th percentiles.

The plot shows distributions of the area under the receiver operating characteristic curve, the brier score and the f1-score, based on bootstrap resampling of model predictions.

Model performance can also be determined directly from predictions, which makes it possible to assess predictions made by non-familiar models. This involves conversion of the predictions to a prediction table object using as_prediction_table, as shown below:

predictions <- predict(
  object = ischemic_stroke_model,
  newdata = ischemic_stroke_data
)

# Create a prediction table object using as_prediction_table. The minimum
# you need to provide here are: prediction probabilities for the positive class
# (yes), the reference labels, the type of prediction (classification), 
# and providing the class levels (no, yes).
predictions <- familiar::as_prediction_table(
  x = predictions$yes,
  y = ischemic_stroke_data$stroke,
  type = "classification",
  class_levels = c("no", "yes")
)
#> Error in `familiar::as_prediction_table()`:
#> ! (converted from warning) The positive class cannot be directly inferred from names of the provided prediction data, and is assumed to be yes.

# The prediction table object is then provided as object
plots <- familiar::plot_model_performance(
  object = predictions,
  metric = c("auc", "brier", "f1_score"),
  x_axis_by = "metric"
)
#> Error in `.local()`:
#> ! data_element expects one of feature_expressions, risk_stratification_data, feature_similarity and sample_similarity as input. Found the following unknown values: model_performance
Performance for the ischemic strok model obtained using a 
  prediction table. This results in the same figure as above, obtained
  using the model itself.

Performance for the ischemic strok model obtained using a prediction table. This results in the same figure as above, obtained using the model itself.

This produces the exact same plot. Model performance can also be exported to a table using using export_model_performance.

results <- familiar::export_model_performance(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  metric = c("auc", "brier", "f1_score"),
  aggregate_results = TRUE
)
results[[1L]]@data
#>                       data_set vimp_method                      learner ensemble_model_name   metric     value    ci_low     ci_up
#>                         <fctr>      <fctr>                       <fctr>              <fctr>   <fctr>     <num>     <num>     <num>
#> 1: 173529_2wFvaYlrMM6XXSNOt9rE        none random_forest_ranger_default      ensemble.NA.NA      auc 0.7387626 0.5166193 0.9146674
#> 2: 173529_2wFvaYlrMM6XXSNOt9rE        none random_forest_ranger_default      ensemble.NA.NA    brier 0.2041370 0.1639073 0.2492457
#> 3: 173529_2wFvaYlrMM6XXSNOt9rE        none random_forest_ranger_default      ensemble.NA.NA f1_score 0.7407407 0.5152258 0.8967186

In the above table, data_set is randomly generated because we call this method on a (technically) new dataset. ensemble_model_name contains a placeholder for the same reason. vimp_method and learner are derived from the model, and metrics with 95% confidence intervals are shown.

Receiver-operating characteristic curve

The receiver-operating characteristic (ROC) curve visualises concordance between predicted class probabilities and the observed classes (Hand and Till 2001). It shows the trade-off between sensitivity and specificity of the model. The receiver-operating characteristic curve for the ischemic stroke dataset is as follows:

plots <- familiar::plot_auc_roc_curve(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data
)
Receiver-operating characteristic curve for the ischemic stroke 
 model on hold-out test data. The 95% confidence interval is shown (grey).

Receiver-operating characteristic curve for the ischemic stroke model on hold-out test data. The 95% confidence interval is shown (grey).

Under ideal circumstances, the curve would lie in the top-left corner of the plot. If a model does not predict better than at random, the curve would lie along the diagonal, which corresponds to an AUC-ROC value of 0.50. The plot above indicates that the model can somewhat differentiate between patients with and without stroke.

Familiar creates an ROC curve by ascendingly sorting instances by the predicted probability of the positive class, and computing the number of true positive, true negative, false positive, false negative cases observed at each instance. These are then used to compute sensitivity and specificity.

The above plot shows the average of 400 single curves, with 95% confidence intervals. To achieve this, familiar interrogates each curve over the complete range of potential specificity values ([0.000, 1.000]) with a step size of 0.005. Linear interpolation is used for this purpose (Davis and Goadrich 2006). This allows for assessing the distribution of sensitivity at fixed specificity values.

The above approach is the general approach used by familiar. However, familiar will plot an exact ROC curve if the point estimate (estimation_type="point") of a single model or the complete ensemble of models is evaluated (detail_level="ensemble"):

plots <- familiar::plot_auc_roc_curve(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  estimation_type = "point"
)
Receiver-operating characteristic curve for the ischemic stroke 
 model on hold-out test data. Whereas the previous figure shows the bootstrap
 estimate and confidence interval, the current curve is formed
 directly from the data.

Receiver-operating characteristic curve for the ischemic stroke model on hold-out test data. Whereas the previous figure shows the bootstrap estimate and confidence interval, the current curve is formed directly from the data.

Note that we did not specify the detail level to create the above plot because only a single model is evaluated.

Precision-recall curve

The precision-recall (PR) curve shows the trade-off between precision (positive predictive value) and recall (sensitivity). An important distinction between the PR curve and the ROC curve described above is that precision does not assess true negatives but false positives. In case the number of negative instances greatly exceeds the number of positive instances, a large change in the number of false positives reduces the ROC curve somewhat, but this change is far easier observed in the PR curve (Davis and Goadrich 2006). Hence the PR curve is commonly assessed if the positive class is more important than and rarer compared to the negative class. For the ischemic stroke dataset, the precision-recall curve is as follows:

plots <- familiar::plot_auc_precision_recall_curve(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data
)
Precision-recall curve for the ischemic stroke model on hold-out 
test data. The 95% confidence interval is shown (grey).

Precision-recall curve for the ischemic stroke model on hold-out test data. The 95% confidence interval is shown (grey).

Under ideal circumstances, the curve would lie in the top-right part of the plot. A random classifier would yield a curve that is mostly horizontal and located at the fraction of positive instances. This would equal a precision of 56% for the current dataset.

Like the ROC curve, familiar forms the PR curve by ascendingly sorting instances by the predicted probability of the positive class and computing the number of true positive and false positive negative cases observed at each instance. These values are then used to compute recall and precision.

The above plot shows the average of 400 single curves, with 95% confidence intervals. To achieve this, familiar interrogates each curve over the complete range of recall values ([0.000, 1.000]) with a step size of 0.005. We use linear interpolation for this purpose, treating the PR curve as a piecewise function to manage the situation identified by Davis and Goadrich (Davis and Goadrich 2006). Interpolating in this manner allows for assessing the distribution of precision values at fixed recall values.

Again, this is the general approach used by familiar. Familiar can plot an exact PR curve when evaluating the point estimate (estimation_type="point") of a single model or of the complete ensemble of models (detail_level="ensemble"). In the ischemic strok dataset, this produces the following curve:

plots <- familiar::plot_auc_precision_recall_curve(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  estimation_type = "point"
)
Precision-recall curve for the ischemic stroke 
 model on hold-out test data. Whereas the previous figure shows the bootstrap
 estimate and confidence interval, the current curve is formed
 directly from the data.

Precision-recall curve for the ischemic stroke model on hold-out test data. Whereas the previous figure shows the bootstrap estimate and confidence interval, the current curve is formed directly from the data.

Confusion matrix

A confusion matrix schematically represents the number of predicted (expected) class instances and their actual (observed) values. This may help identify potential weaknesses of classifiers, i.e. false positives or negatives.

plots <- familiar::plot_confusion_matrix(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data
)
Confusion matrix for the ischemic stroke 
 model on hold-out test data. The number of instances with the prediction
 of ischemic stroke (expected sroke), and actually observed stroke, is shown.

Confusion matrix for the ischemic stroke model on hold-out test data. The number of instances with the prediction of ischemic stroke (expected sroke), and actually observed stroke, is shown.

Familiar selects the class with the highest predicted class probability as the expected class. It is currently not possible in familiar to obtain confusion matrices for set probability thresholds, such as the threshold that maximises the Youden’s Index.

Kaplan-Meier survival curves

Kaplan-Meier survival curves are a standard method for assessing survival over time in a population, or for assessing survival differences between groups.

Familiar applies one or more thresholds to stratify instances to risk groups. These thresholds are created during training to avoid bias. The number and value of these thresholds is determined by two parameters: stratification_method and stratification_threshold. Both parameters are set when calling summon_familiar or train_familiar. stratification_method has three options:

The stratification_threshold parameter is only used when stratification_method = "fixed". It allows for specifying the sample quantiles that are used to determine the thresholds. For example stratification_threshold = c(0.25,0.75) creates two thresholds that stratifies the development dataset into three groups: one with 25% of the instances with the highest risk, a group with 50% of the instances with medium risk, and a third group with 25% of instances with lowest risk.

Applying the survival model to the colon cancer dataset yields the following Kaplan-Meier plot:

plots <- familiar::plot_kaplan_meier(
  object = survival_model,
  data = survival_data
)
Kaplan-Meier plot for the colon cancer survival model on hold-out
 test data. The hold-out test data were split into two risk groups by
 applying a threshold that was defined during training. 95% confidence
 intervals are shown. The number of patients surviving without tumour recurrence
 is shown below the main plot.

Kaplan-Meier plot for the colon cancer survival model on hold-out test data. The hold-out test data were split into two risk groups by applying a threshold that was defined during training. 95% confidence intervals are shown. The number of patients surviving without tumour recurrence is shown below the main plot.

In the plot above, familiar divides instances into two groups by the median threshold in the development dataset. The risk groups are shown with 95% confidence intervals. Familiar uses the survival::survfit function to create both the survival curves and the confidence intervals. The low-risk group has significantly better survival than the high-risk group. This is also indicated in the bottom-left of the main plot, which shows the result of a logrank test between the two groups. Crosses indicate right-censored patients, of which there were few. The number of surviving patients in both groups at the time points along the x-axis are shown below the main plot.

Model calibration

Evaluating model performance is important as it allows us to assess how accurate a model is. However, it often does not tell us how well we can rely on predictions for individual instances, e.g. how accurate an instance class probability is. Model calibration is assessed for this purpose. The model for the ischemic stroke dataset produces the following calibration plot:

plots <- familiar::plot_calibration_data(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data
)
Calibration plot for the ischemic stroke model on hold-out
 test data. The plot shows the predicted (expected) probability of ischemic 
 stroke, against the observed occurrence within smaller groupings of data. 
 The plot shows the ideal calibration (dashed line), the linear calibration
 fit (thin line), and estimated calibration (thick line). 95% confidence
 intervals, based on bootrstraps, are shown around the estimated calibration. 
 The line above the plot shows the density of predicted probabilities. 
 Calibration outside the main distribution is more uncertain. The inset 
 describes the intercept and slope of the linear fit, and the results
 of the Hosmer-Lemeshow goodness-of-fit test.

Calibration plot for the ischemic stroke model on hold-out test data. The plot shows the predicted (expected) probability of ischemic stroke, against the observed occurrence within smaller groupings of data. The plot shows the ideal calibration (dashed line), the linear calibration fit (thin line), and estimated calibration (thick line). 95% confidence intervals, based on bootrstraps, are shown around the estimated calibration. The line above the plot shows the density of predicted probabilities. Calibration outside the main distribution is more uncertain. The inset describes the intercept and slope of the linear fit, and the results of the Hosmer-Lemeshow goodness-of-fit test.

The calibration curve shows the expected (predicted) probability of ischemic stroke versus the observed proportion of ischemic stroke, together with a 95% confidence interval. In an ideal calibration plot the calibration curve should lie on the diagonal (dashed) line. Low expected probabilities tend to overestimate the observed probability, whereas high expected probabilities tend to underestimate the observed probability. The density plot in the top panel indicates that predicted probabilities were distributed around 0.5, with relatively few low- and high-probability cases being predicted.

Familiar also performs several calibration tests to assess various degrees of calibration (Van Calster et al. 2016). First of all, we assess two aspects of the linear fit (solid line in the plot): the intercept (calibration-in-the-large) and slope. These are shown in the top-left of the main plot.

The intercept ideally is 0.00. In our example, it lies below this point, but is still contained in the 95% confidence interval, owing to a small number of samples in the test set.

In an ideal case, the slope of the linear fit is 1.00. In our example, the slope of the linear fit lies above this value.

One issue with assessing calibration through a linear fit, is that the calibration curve can be decidedly non-linear and still produce good values for the intercept and slope. Therefore, we also assess calibration using statistical goodness-of-fit tests: the Hosmer-Lemeshow (HL) test for categorical and regression outcomes (Hosmer et al. 1997), and the Nam-D’Agostino (ND) (D’Agostino and Nam 2003) and Greenwood-Nam-D’Agostino (GND) (Demler et al. 2015) tests for survival outcomes.

Implementation details

Here we detail the implementation of the model calibration in familiar. The statistical tests mentioned above, as well as the linear fit, rely on grouping instances. First we order the instances in the dataset by their expected (predicted) values. Within each group, the expected values and observed values are compared:

We can observe the points thus created in a calibration plot when estimation_type="point":

plots <- familiar::plot_calibration_data(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  estimation_type = "point"
)
Calibration plot for the ischemic stroke model on hold-out
 test data. Whereas in the earlier figure the calibration curve was the
 bootstrap estimate, here the expected probability and observed proportions 
 in several (random) groupings is shown.

Calibration plot for the ischemic stroke model on hold-out test data. Whereas in the earlier figure the calibration curve was the bootstrap estimate, here the expected probability and observed proportions in several (random) groupings is shown.

The plot with point estimates and the earlier calibration plot with bootstrap confidence intervals lead to the same assessment. Moreover, you may not arrive at the same plot. This happens because familiar introduces randomness into the assessment of calibration. First, familiar randomises the number of groups drawn because using a fixed number of 10 groups is arbitrary. Second, the actual group sizes are randomised by drawing group identifiers with replacement for all instances. We then sort the set of drawn identifiers, and the instances, ordered by increasing expected value, are then assigned to their respective group.

Linear fit intercept and slope as well as the goodness-of-fit test p-value are determined for each single set of groups. Intercept and slopes are then averaged (point and bias_correction) or assessed as a distribution (bootstrap_confidence_interval). p-values from goodness-of-fit tests need to be combined. These tests are not fully independent because they derive from the same dataset. We therefore compute a harmonic mean p-value (Wilson 2019) from the individual p-values.

Decision curve analysis

Decision curve analysis is a method to assess the net (clinical) benefit of models, introduced by Vickers and Elkin (Vickers and Elkin 2006). The decision curve compares the model benefit for different threshold probabilities against two or more options. The two standard options are that all instances receive an intervention, and none receive an intervention. What constitutes an intervention is model-dependent (Vickers et al. 2019). For our ischemic stroke cohort, intervention might constitute treatment of all patients, versus no treatment and treatment for patients with malignancy predicted by the model. This produces the following figure:

plots <- familiar::plot_decision_curve(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  y_range = c(-0.2, 0.8)
)
#> Error in `ggplot2::geom_line()`:
#> ! Problem while converting geom to grob.
#> ℹ Error occurred in the 1st layer.
#> Caused by error:
#> ! (converted from warning) Removed 33 rows containing missing values or values outside the scale range (`geom_line()`).

The shaded curve is the decision curve produced by the model, with 95% confidence intervals. The declining curve represents the intervention for all strategy, which offers no benefit for a probability of ischemic stroke of 56%. The horizontal line at a net benefit of 0.0 represent the no intervention strategy. As may be observed, the model offers an expected net benefit compared to either strategy, but not always significantly.

For regression problems no definition of decision curves exists. However, familiar can perform decision curve analysis for survival outcomes, based on Vickers et al. (Vickers et al. 2008).

Variable importance

Not all features are strongly associated with the outcome of interest. One of the most critical points in machine learning is selecting those features that are important enough to incorporate into a model. Familiar assesses variable importance using the methods detailed in the Variable importance methods vignette during training. Later, during evaluation, familiar employs both model-specific and model-agnostic methods to explaining which features are considered important by the model itself.

Model-specific methods

Model-specific methods rely on aspects of the models themselves to determine variable importance. For instance, the coefficients of regression models are a measure of variable importance when features are normalised. In the case of the random forest model for the ischemic stroke dataset, permutation variable importance is determined. This produces the following plot:

plots <- familiar::plot_model_signature_variable_importance(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  rotate_x_tick_labels = TRUE
)
#> Error in `ggplot2::geom_bar()`:
#> ! Problem while converting geom to grob.
#> ℹ Error occurred in the 1st layer.
#> Caused by error:
#> ! (converted from warning) Removed 13 rows containing missing values or values outside the scale range (`geom_bar()`).

In such plots, we always order features so that the most important features are on the left, and the least important features on the right. Letters indicate features that are clustered together.

In case an ensemble of models is assessed, scores are aggregated using an aggregation method (see the variable importance methods vignette). This method can be set using the aggregation_method argument, or the eval_aggregation_method configuration parameter (summon_familiar).

The plot_model_signature_occurrence method plots the occurrence of features among the first 5 ranks across an ensemble of models (if available). The rank threshold can be specified using the rank_threshold argument or the eval_aggregation_rank_threshold parameter (summon_familiar).

Permutation variable importance

Permutation variable importance is a model-agnostic variable importance method. It assesses importance by measuring the decrease in model performance caused by shuffling values of a feature across the instances in a dataset.

For the ischemic stroke dataset, this results in the following figure:

plots <- familiar::plot_permutation_variable_importance(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  estimation_type = "bias_correction"
)
Permutation variable importance plot for the ischemic stroke model on hold-out
 test data. Features are sorted from highest (top) to least importance (bottom).
 Permutation variable importance measures the model performance difference 
 between normal and shuffled (informationless) input values for each feature.
 At the threshold of 1.00 every feature is shuffled individually, whereas at 0.23
 groups of similar features are shuffled simultaneously.

Permutation variable importance plot for the ischemic stroke model on hold-out test data. Features are sorted from highest (top) to least importance (bottom). Permutation variable importance measures the model performance difference between normal and shuffled (informationless) input values for each feature. At the threshold of 1.00 every feature is shuffled individually, whereas at 0.23 groups of similar features are shuffled simultaneously.

The figure shows the features along the y-axis, and the loss in AUC-ROC caused by permuting the feature values along the x-axis. Accordingly, max_remodeling_ratio is considered to be the most important feature in the model.

The figure shows two bars for each feature. A well-known issue with permutation variable importance is that the presence of highly correlated features causes permutation variable importance to become unreliable (Hooker et al. 2019). If two features are both important and highly correlated, shuffling one of them does not decrease model performance as much. For this reason, familiar will permute features together depending on their mutual correlation, and the provided threshold values.

By default, similarity between features is assessed using McFadden’s pseudo-\(R^2\). This metric can assess similarity between different types of features, i.e. numeric and categorical. The default thresholds are 0.23 and 1.0 that respectively cluster similar features and permute all features individually. The dataset contains many features that are highly similar and exceed the relatively stringent threshold of 0.23.

SHAP summary plots

A SHAP value is the marginal contributions of a feature value to the predicted value. For example, if you were to predict the number of daily ice cream sales, season could be a feature. A SHAP value for the summer season is likely positive, whereas the SHAP value for the winter season would likely be negative. Important features have larger SHAP values than less important features. Thus a SHAP summary plot can be used to evaluate feature importance.

plots <- familiar::plot_shap_summary(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data
)
SHAP summary plot for the ischemic stroke model on hold-out
 test data. Features are sorted from highest (top) to least importance (bottom),
 based on SHAP values. Each point represents the marginal contribution 
 of a feature value to the predicted probability of the model, and is coloured
 according to its (normalised) feature value.

SHAP summary plot for the ischemic stroke model on hold-out test data. Features are sorted from highest (top) to least importance (bottom), based on SHAP values. Each point represents the marginal contribution of a feature value to the predicted probability of the model, and is coloured according to its (normalised) feature value.

The SHAP summary plot shows several things: features are listed along the y-axis, with the most important feature on top, i.e. the feature that has the highest mean absolute SHAP value. Each point corresponds to the value in an instance (a patient), and each point is coloured according to its (normalised) value. For the purpose of representation, all feature values are mapped to the [-1.0, 1.0] range.

The default presentation might provide too much detail. It is therefore possible to change the plot_type or value_representation. For example, the above data could be presented as violin plots:

plots <- familiar::plot_shap_summary(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  plot_type = "violinplot"
)
SHAP summary plot for the ischemic stroke model on hold-out
 test data. Features are sorted from highest (top) to least importance (bottom),
 based on SHAP values. The distribution of SHAP values for each feature
 is shown as a violin plot.

SHAP summary plot for the ischemic stroke model on hold-out test data. Features are sorted from highest (top) to least importance (bottom), based on SHAP values. The distribution of SHAP values for each feature is shown as a violin plot.

The violin plot shows the distribution of values more clearly, at the cost of losing information concerning underlying feature values.

If just variable importance is relevant, the SHAP summary plot can also be shown as a bar plot (plot_type = "barplot") showing the mean absolute SHAP value by default (value_representation = "abs_mean"):

plots <- familiar::plot_shap_summary(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  value_representation = "abs_mean"
)
SHAP summary plot for the ischemic stroke model on hold-out
 test data. Features are sorted from highest (top) to least importance (bottom),
 based on the mean of the absolute SHAP values in the underlying data.

SHAP summary plot for the ischemic stroke model on hold-out test data. Features are sorted from highest (top) to least importance (bottom), based on the mean of the absolute SHAP values in the underlying data.

Feature effects

Variable importance does not explain how features influence the model, although the SHAP summary plot already provided some hints. Familiar offers multiple methods for assessing the effect of features.

Partial dependence and individual conditional expectation plots

Partial dependence (PD) (Friedman 2001) and individual conditional expectation (ICE) (Goldstein et al. 2015) plots can be used to visualise how features influence the model. As an example, we can create an ICE plot for the max_remodeling_ratio feature that was found to be important for predicting ischemic stroke.

plots <- familiar::plot_ice(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  features = "max_remodeling_ratio"
)
Individual conditional expectation plot for the max_remodeling_ratio feature
 in the ischemic stroke model using instances from the hold-out test data. 
 Each curve follows the predicted probability for an instance where only
 the value of the max_remodeling_ratio is changed. The thick line corresponds to
 the average of all individual curves, i.e. partial dependence. High novelty scores
 signify values that are more likely to be outliers.

Individual conditional expectation plot for the max_remodeling_ratio feature in the ischemic stroke model using instances from the hold-out test data. Each curve follows the predicted probability for an instance where only the value of the max_remodeling_ratio is changed. The thick line corresponds to the average of all individual curves, i.e. partial dependence. High novelty scores signify values that are more likely to be outliers.

The ICE plot shows multiple curves for individual samples, as well as their average, the partial dependence plot as a somewhat thicker curve. These curves are created by iteratively updating the value of max_remodeling_ratio and then observing the predicted outcome. To generate the sample points, familiar samples the feature value distribution, as observed in the training dataset, at regular percentile intervals. Also note that by default, ICE plots only show a limited number of samples to prevent clutter. This can be specified through n_max_samples_shown.

The plot also shows novelty, which is a measure of how out-of-distribution the value is. Novelty is computed using extended isolation forests (Hariri et al. 2019) and is the average normalised height at which a sample is found in a terminal leaf node of the trees in the isolation forest (Liu et al. 2008). Novelty values range between 0.00 and 1.00. Novelty values should be interpreted with care. While increasing novelty values represent instances that lie further from the distribution, the values that represent in-distribution instances may vary. In the ICE plot above, novelty values range between 0.40 and 0.50. None of the points in ICE plot can therefore be considered out-of-distribution. Move to a conformal prediction framework is planned for future versions.

It is possible to anchor the curves in the ICE plot to a specific feature value. This allows for assessing the model response for individual instances versus a fixed value. In effect this can reduce some of the offset caused by other (fixed) features in each instance, and may help to elucidate the observed behaviour:

plots <- familiar::plot_ice(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  features = "max_remodeling_ratio",
  anchor_values = list("max_remodeling_ratio" = 2.5)
)
Individual conditional expectation plot for the max_remodeling_ratio feature
 in the ischemic stroke model using instances from the hold-out test data. 
 Each curve follows the predicted probability for an instance where only
 the value of the max_remodeling_ratio is changed. This curve was anchored at
 max_remodeling_ratio=2.5, i.e. at this value, every curve has a (relative)
 predicted probability of 0.0. The thick line corresponds to the average of all
 individual curves, i.e. partial dependence. High novelty scores
 signify values that are more likely to be outliers.

Individual conditional expectation plot for the max_remodeling_ratio feature in the ischemic stroke model using instances from the hold-out test data. Each curve follows the predicted probability for an instance where only the value of the max_remodeling_ratio is changed. This curve was anchored at max_remodeling_ratio=2.5, i.e. at this value, every curve has a (relative) predicted probability of 0.0. The thick line corresponds to the average of all individual curves, i.e. partial dependence. High novelty scores signify values that are more likely to be outliers.

In the plot above, we have anchored the values at max-remodeling-ratio = 2.5. All curves are then drawn relative to the predicted probability found for this particular value.

It is also possible to show the partial dependence of two features at the same time:

plots <- familiar::plot_ice(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  features = c("max_remodeling_ratio", "coronary_artery_disease")
)
Partial depence plot for the max_remodeling_ratio and coronary_artery_disease
 features in the ischemic stroke model using instances from the hold-out test data. 
 Individual curves are not shown for 2D plots. High novelty scores,
 indicated by smaller points, signify values that are more likely to be outliers.

Partial depence plot for the max_remodeling_ratio and coronary_artery_disease features in the ischemic stroke model using instances from the hold-out test data. Individual curves are not shown for 2D plots. High novelty scores, indicated by smaller points, signify values that are more likely to be outliers.

The partial dependence plot above shows max_remodeling_ratio along the x-axis and coronary_artery_disease along the y-axis. Higher intensities are associated with a higher expected probability of malignancy. The figure thus indicates that higher values of max_remodeling_ratio clearly affect the predicted probability of ischemic stroke. The presence of coronary_artery_disease can be seen to decrease the probability of ischemic stroke - probably because these patients arrived at the hospital due to this condition, not necessarily cerebral stroke. The average novelty score of the underlying instances determines the size of each point.

It also possible to anchor 2D partial dependence plots:

plots <- familiar::plot_ice(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  features = c("max_remodeling_ratio", "coronary_artery_disease"),
  anchor_values = list(
    "max_remodeling_ratio" = 2.5,
    "coronary_artery_disease" = "no"
  ),
  show_novelty = FALSE
)
Partial depence plot for the max_remodeling_ratio and coronary_artery_disease
 features in the ischemic stroke model using instances from the hold-out test data.
 Without novelty, elements are shown as rectangles to better show probability.
 Moreover, probabilities were anchored at max_remodeling_ratio = 2.5, and
 coronary_artery_disease = no. Individual curves are not shown for 2D plots.

Partial depence plot for the max_remodeling_ratio and coronary_artery_disease features in the ischemic stroke model using instances from the hold-out test data. Without novelty, elements are shown as rectangles to better show probability. Moreover, probabilities were anchored at max_remodeling_ratio = 2.5, and coronary_artery_disease = no. Individual curves are not shown for 2D plots.

We disabled novelty (show_novelty=FALSE) for plotting the 2D anchored plot. This changes the appearance of the plot, which now consists of coloured rectangles. The white dots show the position at which familiar samples the feature values.

Familiar automatically determines the values at which numerical features are sampled based on their distribution in the development dataset. Such values can also be manually set using the feature_x_range and feature_y_range arguments.

SHAP dependence plots

SHAP dependence plots function similar to dependence plots. The SHAP dependence plot below shows how the SHAP values of max_remodeling_ratio are related to its feature values.

plots <- familiar::plot_shap_dependence(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  shap_feature = "max_remodeling_ratio"
)
SHAP dependence plot for the max_remodeling_ratio feature
 in the ischemic stroke model using instances from the hold-out test data.

SHAP dependence plot for the max_remodeling_ratio feature in the ischemic stroke model using instances from the hold-out test data.

The above plot shows that with increasing value of max_remodeling_ratio, the marginal contribution to the prediction changes from negative (low values decrease the probability of the ischemic stroke) to positive (high values increase the probability of the ischemic stroke).

SHAP dependence plots can also show interaction with other features. For example, we can show the interaction with the coronary_artery_disease feature.

plots <- familiar::plot_shap_dependence(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  shap_feature = "max_remodeling_ratio",
  interaction_feature = "coronary_artery_disease"
)
SHAP dependence plot for the max_remodeling_ratio feature
 in the ischemic stroke model using instances from the hold-out test data.
 Each point shows the corresponding value for coronary_artery_disease.

SHAP dependence plot for the max_remodeling_ratio feature in the ischemic stroke model using instances from the hold-out test data. Each point shows the corresponding value for coronary_artery_disease.

The addition of an interaction feature does not alter the shape of the plot, but now individual values are coloured by the value of the coronary_artery_disease feature for the same sample. It appears that coronary_artery_disease is not strongly interacting with max_remodeling_ratio.

plots <- familiar::plot_shap_dependence(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  shap_feature = "max_remodeling_ratio",
  interaction_feature = "max_dilation_by_area"
)
SHAP dependence plot for the max_remodeling_ratio feature
 in the ischemic stroke model using instances from the hold-out test data.
 Each point shows the corresponding value for max_dilation_by_area.

SHAP dependence plot for the max_remodeling_ratio feature in the ischemic stroke model using instances from the hold-out test data. Each point shows the corresponding value for max_dilation_by_area.

The max_dilation_by_area feature, however, is much more strongly correlated, as shown in the SHAP dependence plot above.

SHAP force plots

SHAP force plots show the force exerted by features for individual samples. Samples are (by default) ordered by increasing predicted value, and the positive and negative marginal contributions are shown for each sample.

plots <- familiar::plot_shap_force(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data
)
SHAP force plot for the ischemic stroke model using all instances
 from the hold-out test data. The force plot show the stacked positive and 
 negative marginal contributions for all features.

SHAP force plot for the ischemic stroke model using all instances from the hold-out test data. The force plot show the stacked positive and negative marginal contributions for all features.

Force plots are not highly informative. However, one can set one or more highlight features. For example, setting max_remodeling_ratio as a highlight yields the following figure:

plots <- familiar::plot_shap_force(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  highlight_feature = "max_remodeling_ratio"
)
SHAP force plot for the ischemic stroke model using all instances
 from the hold-out test data. The force plot show the stacked positive and 
 negative marginal contributions for all features. The contributions by
 the max_remodeling_ratio feature are highlighted.

SHAP force plot for the ischemic stroke model using all instances from the hold-out test data. The force plot show the stacked positive and negative marginal contributions for all features. The contributions by the max_remodeling_ratio feature are highlighted.

SHAP waterfall plots

SHAP waterfall plots show how SHAP values for each feature value contribute to the predicted value for an individual sample.

plots <- familiar::plot_shap_waterfall(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data
)
SHAP waterfall plot for the ischemic stroke model for a single sample
 from the hold-out test data. The waterfall plot shows the marginal
 contributions of each feature for a sample. Features with the largest
 contributions are shown at the top.

SHAP waterfall plot for the ischemic stroke model for a single sample from the hold-out test data. The waterfall plot shows the marginal contributions of each feature for a sample. Features with the largest contributions are shown at the top.

The above plot shows a waterfall plot for the first sample, which has a predicted value of \(0.43\), where the average predicted value in ischemic_stroke_data (i.e. the hold-out test set that we initially defined) is \(0.54\). Feature values are shown on the left, and all features are ordered by the magnitude of their corresponding SHAP value.

Feature and sample similarity

Highly similar features can carry redundant information for a model. This can be visualised using similarity heatmaps:

plots <- familiar::plot_feature_similarity(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  feature_similarity_metric = "spearman",
  rotate_x_tick_labels = TRUE
)
Feature similarity heatmap for features used in the ischemic
 stroke model. Similarity is computed using Spearman's rank correlation
 coefficient. Features are clustered by similarity, with strongly (anti-)correlated
 features being grouped together more closely.

Feature similarity heatmap for features used in the ischemic stroke model. Similarity is computed using Spearman’s rank correlation coefficient. Features are clustered by similarity, with strongly (anti-)correlated features being grouped together more closely.

Because we specify feature_similarity_metric = "spearman" the heatmap displays Spearman’s correlation coefficient between the different features (even categorical ones). The features are also ordered by similarity, so that clusters of similar features can be identified. Note that the dendrograms depict cluster distance, i.e. \(1-|\rho|\) for Spearman’s correlation coefficient.

It is also possible to view the sample clustering. This can be used to identify if samples are readily stratified into clusters, which may or may not correlate to the endpoint of interest. For the ischemic stroke dataset, this yields the following heatmap:

plots <- familiar::plot_sample_clustering(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  feature_similarity_metric = "spearman",
  rotate_x_tick_labels = TRUE
)
Sample similarity heatmap of samples in the hold-out test set based on 
 features used in the ischemic stroke model. Similarity between features is
 computed using Spearman's rank correlation coefficient, whereas similarity
 between samples is computed using Gower's distance. Strongly (anti-)correlated
 features and samples being grouped together more closely, respectively.
 Feature values are normalised.

Sample similarity heatmap of samples in the hold-out test set based on features used in the ischemic stroke model. Similarity between features is computed using Spearman’s rank correlation coefficient, whereas similarity between samples is computed using Gower’s distance. Strongly (anti-)correlated features and samples being grouped together more closely, respectively. Feature values are normalised.

The sample clustering figure shows an absence of clear structure among the samples, nor a clear ordering of features with their observed class label (ischemic stroke). The top dendrogram is the same as for feature clustering, as it should be. The right dendrogram is formed after computing Gower’s distance between samples (by default), and shows clustering of samples. The heatmap itself shows normalised values for the features, based on normalisation and transformation parameters obtained during model development.

Normalisation options can be altered by changing the show_normalised_data argument. However, in the ischemic stroke dataset features do not have the same value ranges, so without normalisation the figure looks less interpretable:

plots <- familiar::plot_sample_clustering(
  object = ischemic_stroke_model,
  data = ischemic_stroke_data,
  feature_similarity_metric = "spearman",
  show_normalised_data = "none",
  rotate_x_tick_labels = TRUE
)
Sample similarity heatmap of samples in the hold-out test set based on 
 features used in the ischemic stroke model. Similarity between features is
 computed using Spearman's rank correlation coefficient, whereas similarity
 between samples is computed using Gower's distance. Strongly (anti-)correlated
 features and samples being grouped together more closely, respectively.
 Feature values were not normalised (for plotting), and are coloured according to
 their actual value.

Sample similarity heatmap of samples in the hold-out test set based on features used in the ischemic stroke model. Similarity between features is computed using Spearman’s rank correlation coefficient, whereas similarity between samples is computed using Gower’s distance. Strongly (anti-)correlated features and samples being grouped together more closely, respectively. Feature values were not normalised (for plotting), and are coloured according to their actual value.

References

D’Agostino, R B, and Byung-Ho Nam. 2003. “Evaluation of the Performance of Survival Analysis Models: Discrimination and Calibration Measures.” In Handbook of Statistics, vol. 23. Elsevier.
Davis, Jesse, and Mark Goadrich. 2006. “The Relationship Between Precision-Recall and ROC Curves.” Proceedings of the 23rd International Conference on Machine Learning (New York, NY, USA), ICML ’06, June, 233–40.
Demler, Olga V, Nina P Paynter, and Nancy R Cook. 2015. “Tests of Calibration and Goodness-of-Fit in the Survival Setting.” Stat. Med. 34 (10): 1659–80.
Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference. Cambridge University Press.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Ann. Stat. 29 (5): 1189–232.
Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” J. Comput. Graph. Stat. 24 (1): 44–65.
Hand, David J, and Robert J Till. 2001. “A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems.” Mach. Learn. 45 (2): 171–86.
Hariri, S, M Carrasco Kind, and R J Brunner. 2019. “Extended Isolation Forest.” IEEE Trans. Knowl. Data Eng., 1–1.
Hooker, Giles, Lucas Mentch, and Siyu Zhou. 2019. Unrestricted Permutation Forces Extrapolation: Variable Importance Requires at Least One More Model, or There Is No Free Variable Importance. May. https://arxiv.org/abs/1905.03151.
Hosmer, D W, T Hosmer, S Le Cessie, and S Lemeshow. 1997. “A Comparison of Goodness-of-Fit Tests for the Logistic Regression Model.” Stat. Med. 16 (9): 965–80.
Hothorn, Torsten, and Berthold Lausen. 2003. “On the Exact Distribution of Maximally Selected Rank Statistics.” Comput. Stat. Data Anal. 43 (2): 121–37.
Lausen, Berthold, and Martin Schumacher. 1992. “Maximally Selected Rank Statistics.” Biometrics 48 (1): 73.
Liu, F T, K M Ting, and Z Zhou. 2008. “Isolation Forest.” 2008 Eighth IEEE International Conference on Data Mining, December, 413–22.
Moertel, C G, T R Fleming, J S Macdonald, et al. 1995. “Fluorouracil Plus Levamisole as Effective Adjuvant Therapy After Resection of Stage III Colon Carcinoma: A Final Report.” Ann. Intern. Med. 122 (5): 321–26.
Van Calster, Ben, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J Pencina, and Ewout W Steyerberg. 2016. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data.” J. Clin. Epidemiol. 74 (June): 167–76.
Vickers, Andrew J, Ben van Calster, and Ewout W Steyerberg. 2019. “A Simple, Step-by-Step Guide to Interpreting Decision Curve Analysis.” Diagn Progn Res 3 (1): 18.
Vickers, Andrew J, Angel M Cronin, Elena B Elkin, and Mithat Gonen. 2008. “Extensions to Decision Curve Analysis, a Novel Method for Evaluating Diagnostic Tests, Prediction Models and Molecular Markers.” BMC Med. Inform. Decis. Mak. 8 (November): 53.
Vickers, Andrew J, and Elena B Elkin. 2006. “Decision Curve Analysis: A Novel Method for Evaluating Prediction Models.” Med. Decis. Making 26 (6): 565–74.
Wilson, Daniel J. 2019. “The Harmonic Mean p-Value for Combining Dependent Tests.” Proc. Natl. Acad. Sci. U. S. A. 116 (4): 1195–200.