Consensus Clustering with Multiple Imputed Data using cclustr

Anhuar Duran Mendoza, Andres Montenegro Lemus, Mario Pacheco Lopez

2026-05-12

Introduction

Missing data is a pervasive challenge in applied statistics. When performing cluster analysis, the standard practice of imputing a single dataset and then clustering ignores the uncertainty introduced by the imputation process, potentially leading to unstable or misleading cluster assignments.

cclustr addresses this problem by applying clustering algorithms across all imputed datasets and synthesizing the results into a single, robust consensus partition. The core idea is the co-assignment matrix: a symmetric matrix \(C\) where each entry \(C_{ij}\) represents the proportion of imputed datasets in which observations \(i\) and \(j\) were assigned to the same cluster. High values indicate stable, robust groupings across imputations; low values indicate ambiguous assignments.

The main workflow of cclustr consists of four steps:

  1. Standardize imputed data with as_mild_list()
  2. Cluster each imputed dataset with cluster_imputations()
  3. Synthesize results into a consensus partition with consensus_clustering()
  4. Validate and select the optimal number of clusters with validate_clustering() and choose_best_clustering()

All four steps can also be executed in a single call using the convenience wrapper run_mi_clustering().


Dataset: Pima Indians Diabetes

We use the Pima Indians Diabetes Database (PimaIndiansDiabetes2) from the mlbench package, which contains clinical measurements for 768 women of Pima Indian heritage. The dataset presents 652 missing values distributed across several variables, strongly motivating the use of multiple imputation before clustering.

Variables

Variable Type Description
pregnant Numeric Number of times pregnant
glucose Numeric Plasma glucose concentration (2h oral test)
pressure Numeric Diastolic blood pressure (mm Hg)
triceps Numeric Triceps skin fold thickness (mm)
insulin Numeric 2-hour serum insulin (mu U/ml)
mass Numeric Body mass index (kg/m²)
pedigree Numeric Diabetes pedigree function
age Numeric Age in years
diabetes Factor Diabetes test result (positive/negative)

Note: The diabetes variable is excluded from the clustering analysis. This avoids circular reasoning and respects the unsupervised nature of the analysis.

Loading the data

library(cclustr)
library(mice)

if (!requireNamespace("mlbench", quietly = TRUE)) {
  stop("Package 'mlbench' is required for this vignette. ",
       "Install it with: install.packages('mlbench')")
}

data("PimaIndiansDiabetes2", package = "mlbench")

# Remove diabetes from the clustering dataset
df <- PimaIndiansDiabetes2[, -9]

cat("Dimensions:", nrow(df), "x", ncol(df), "\n")
#> Dimensions: 768 x 8
cat("\nMissing values per variable:\n")
#> 
#> Missing values per variable:
print(colSums(is.na(df)))
#> pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
#>        0        5       35      227      374       11        0        0
cat("\nTotal missing values:", sum(is.na(df)),
    "(", round(mean(is.na(df)) * 100, 1), "%)\n")
#> 
#> Total missing values: 652 ( 10.6 %)

The high proportion of missing values — particularly in insulin (48.7%) and triceps (29.6%) — makes this dataset a compelling real-world case study for multiple-imputation clustering.


Step 1: Multiple Imputation with mice

We generate 20 imputed datasets using mice with predictive mean matching (pmm). The as_mild_list() function from cclustr then converts the mids object into a standardized named list of completed data frames.


imp <- mice(df, m = 20, maxit = 20, method = "pmm", printFlag = FALSE, seed = 202604)

# Convert to cclustr standard format
mi_pima <- as_mild_list(imp)

cat("Number of imputed datasets :", length(mi_pima), "\n")
#> Number of imputed datasets : 20
cat("Dimensions of each dataset :", nrow(mi_pima[[1]]),
    "x", ncol(mi_pima[[1]]), "\n")
#> Dimensions of each dataset : 768 x 8
cat("Any missing values remaining:", any(sapply(mi_pima, anyNA)), "\n")
#> Any missing values remaining: FALSE

as_mild_list() performs strict consistency checks across all imputed datasets, ensuring identical dimensions, column names, and the absence of NA, NaN, Inf, or constant-valued columns.


Step 2: Clustering Each Imputed Dataset

Since all clustering variables are numeric, we use k-means clustering and scale_data = "global". The global scaling option computes the pooled mean and standard deviation across all imputations without stacking them in memory, ensuring that distances are on the same scale across all imputed datasets — a key requirement for a meaningful co-assignment matrix.

We evaluate \(k = 2, 3, 4\) clusters simultaneously.

set.seed(202604)
partitions <- cluster_imputations(
  imp_list   = mi_pima,
  method     = "kmeans",
  k          = 2:4,
  scale_data = "global"
)

cat("Available k values:", names(partitions), "\n")
#> Available k values: k2 k3 k4
cat("Number of partitions per k:", length(partitions$k2), "\n")
#> Number of partitions per k: 20

The result is a named list of lists: one element per \(k\) value (k2, k3, k4), each containing one integer vector of cluster assignments per imputed dataset.


Step 3: Consensus Clustering

consensus_clustering() builds the co-assignment matrix \(C\) and derives a final partition from it. This process involves two conceptually distinct steps:

  1. How the co-assignment matrix is constructed (consensus_method)
  2. How the final clusters are extracted (cluster_method)

The consensus_method parameter

The consensus_method argument controls how each partition contributes to the co-assignment matrix \(C\).

Method Description
"classic" Unweighted co-assignment. All partitions contribute equally to \(C\). This is the standard approach in consensus clustering literature.
"weighted_ari" Partitions are weighted according to their centrality within the ensemble, measured by their mean Adjusted Rand Index (ARI) against all other partitions. More consistent partitions have greater influence on \(C\). Requires the mclust package.

The cluster_method parameter

The cluster_method argument controls the agglomeration method used in this second-stage hierarchical clustering step. It is important to distinguish it from the method argument in cluster_imputations():

The available options for cluster_method are the standard agglomeration methods from stats::hclust():

Method Description
"ward.D2" (default) Minimizes total within-cluster variance. Tends to produce compact, roughly equal-sized clusters. Recommended for most cases.
"ward.D" Earlier Ward formulation; similar behavior to "ward.D2" but uses unsquared distances internally.
"complete" Maximum linkage. Tends to produce compact clusters of similar diameter.
"average" UPGMA. Compromise between single and complete; less sensitive to outliers than complete linkage.
"single" Minimum linkage. Sensitive to chaining effects; not recommended for consensus matrices.
"centroid" Uses the centroid of each cluster. Can produce inversions in the dendrogram.
"median" Weighted centroid. Also susceptible to inversions.
"mcquitty" WPGMA. Similar to average but uses equal weighting.

For consensus matrices, "ward.D2" and "complete" are generally the most appropriate choices, as they favor compact, well-separated groups that match the block-diagonal structure of a stable co-assignment matrix.

consensus_results <- consensus_clustering(
  partitions     = partitions,
  cluster_method = "ward.D2",
  consensus_method = "classic"
)

cat("Consensus solutions available:", names(consensus_results), "\n")
#> Consensus solutions available: k2 k3 k4

Step 4: Validation and Optimal k Selection

validate_clustering() computes a comprehensive set of internal and stability metrics for each candidate \(k\). All metrics are computed on the consensus dissimilarity \(1 - C\), making them robust to any data type.

val_table <- validate_clustering(
  partitions        = partitions,
  consensus_results = consensus_results
)

print(val_table)
#>   k       pac silhouette_mean ari_mean_between_imputations ari_consensus_mean
#> 1 2 0.2156841       0.8687290                    0.7867553          0.8138855
#> 2 3 0.2421807       0.8036448                    0.7318288          0.7933822
#> 3 4 0.2294383       0.7635850                    0.7117037          0.7807573
#>   calinski_harabasz_mean davies_bouldin_mean dunn_index
#> 1               7596.212           0.1053665  0.1176471
#> 2               3520.778           0.2050010  0.1000000
#> 3               2090.054           0.2125424  0.1052632

Metric Reference

Metric Better when Description
pac Lower ↓ Proportion of Ambiguous Clusterings
silhouette_mean Higher ↑ Mean silhouette width
ari_mean_between_imputations Higher ↑ Stability across imputed datasets
ari_consensus_mean Higher ↑ Agreement between consensus and each partition
calinski_harabasz_mean Higher ↑ Cluster compactness and separation
davies_bouldin_mean Lower ↓ Average cluster similarity
dunn_index Higher ↑ Min inter-cluster / max intra-cluster distance

choose_best_clustering() aggregates all metrics using weighted rank aggregation: each metric is independently ranked and combined into a single score. The \(k\) with the lowest weighted rank score is selected as optimal.

best <- choose_best_clustering(
  validation_table  = val_table,
  consensus_results = consensus_results,
  prefer_stability  = TRUE,
  tie_breaker       = "silhouette"
)

cat("Optimal k selected:", best$best_k, "\n")
#> Optimal k selected: 2
cat("\nConsensus cluster sizes:\n")
#> 
#> Consensus cluster sizes:
print(table(best$best_consensus))
#> 
#>   1   2 
#> 349 419
cat("\nWeighted rank scores:\n")
#> 
#> Weighted rank scores:
print(best$scores_table[, c("k", "pac", "silhouette_mean",
                             "ari_mean_between_imputations", "score")])
#>   k       pac silhouette_mean ari_mean_between_imputations    score
#> 1 2 0.2156841       0.8687290                    0.7867553 1.000000
#> 2 3 0.2421807       0.8036448                    0.7318288 2.285714
#> 3 4 0.2294383       0.7635850                    0.7117037 2.714286

Plotting Validation Metrics

plot_validation_metrics(best)
Validation metrics across k = 2, 3, 4. The optimal value per metric is highlighted in red.
Validation metrics across k = 2, 3, 4. The optimal value per metric is highlighted in red.

Visualizing the Consensus Solution

All visualizations are based on the optimal k selected in the previous step, accessed through best$best_consensus_result.

Co-assignment Matrix

A well-defined consensus solution produces a block-diagonal pattern: bright blocks along the diagonal indicate observations consistently assigned to the same cluster across all imputed datasets, while dark off-diagonal regions indicate observations rarely grouped together.

plot_consensus_matrix(
  consensus_result = best,
  reorder          = TRUE,
  viridis_option   = "D",
  show_labels = FALSE
)
Co-assignment matrix for the optimal k. A clear block-diagonal structure indicates a stable consensus.
Co-assignment matrix for the optimal k. A clear block-diagonal structure indicates a stable consensus.

Consensus Dendrogram

The dendrogram is built from the hierarchical clustering of the consensus dissimilarity \(1 - C\). Colored rectangles highlight the groups obtained by cutting the tree at the optimal \(k\).

plot_consensus_dendrogram(
  consensus_result = best,
  rect             = TRUE,
  labels           = FALSE
)
Consensus dendrogram for the optimal k, derived from the dissimilarity matrix 1 - C.
Consensus dendrogram for the optimal k, derived from the dissimilarity matrix 1 - C.

Interpreting the Final Clusters

We attach the consensus labels to the first imputed dataset for a descriptive summary of each cluster’s clinical profile.

pima_clustered          <- mi_pima[[1]]
pima_clustered$cluster  <- factor(best$best_consensus)

# Mean values per cluster
num_vars <- c("pregnant", "glucose", "pressure",
              "triceps", "insulin", "mass", "pedigree", "age")

cat("Mean clinical values by cluster:\n")
#> Mean clinical values by cluster:
print(
  aggregate(
    pima_clustered[, num_vars],
    by  = list(Cluster = pima_clustered$cluster),
    FUN = function(x) round(mean(x, na.rm = TRUE), 2)
  )
)
#>   Cluster pregnant glucose pressure triceps insulin  mass pedigree   age
#> 1       1     5.42  139.72    78.48   33.87  204.56 36.02     0.52 40.36
#> 2       2     2.53  106.51    66.96   24.51  111.37 29.56     0.43 27.32

Key Variables by Cluster

par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))

boxplot(
  glucose ~ cluster,
  data = pima_clustered,
  main = "Plasma Glucose by Cluster",
  xlab = "Cluster",
  ylab = "Glucose (mg/dl)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)

boxplot(
  mass ~ cluster,
  data = pima_clustered,
  main = "Body Mass Index by Cluster",
  xlab = "Cluster",
  ylab = "BMI (kg/m²)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)

boxplot(
  insulin ~ cluster,
  data = pima_clustered,
  main = "Insulin",
  xlab = "Cluster",
  ylab = "Insulin (mu U/ml)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)
Distribution of glucose and BMI by consensus cluster.
Distribution of glucose and BMI by consensus cluster.
par(mfrow = c(1, 1))

Complete Pipeline with run_mi_clustering()

All previous steps can be executed in a single call using run_mi_clustering(). This is the recommended approach for most users.

set.seed(202604)

result <- run_mi_clustering(
  data             = imp,
  method           = "kmeans",
  k                = 2:4,
  scale_data       = "global",
  consensus_method = "classic",
  prefer_stability = TRUE,
  tie_breaker      = "silhouette"
)

cat("Best k selected:", result$best_k, "\n")
#> Best k selected: 2
cat("\nCluster sizes:\n")
#> 
#> Cluster sizes:
print(table(result$best_consensus))
#> 
#>   1   2 
#> 349 419

Advanced: ARI-Weighted Consensus

By default all imputed partitions contribute equally to the co-assignment matrix (consensus_method = "classic"). When partitions vary in quality, consensus_method = "weighted_ari" weights each partition by its mean Adjusted Rand Index (ARI) against all others — so more consistent partitions contribute more to the final consensus.

set.seed(202604)

result_w <- run_mi_clustering(
  data             = imp,
  method           = "kmeans",
  k                = 2:4,
  scale_data       = "global",
  consensus_method = "weighted_ari",
  prefer_stability = TRUE,
  tie_breaker      = "silhouette"
)

cat("Best k (ARI-weighted consensus):", result_w$best_k, "\n")
#> Best k (ARI-weighted consensus): 2
print(table(result_w$best_consensus))
#> 
#>   1   2 
#> 348 420