Consensus Clustering with Multiple Imputed Data using cclustr

Introduction

Missing data is a pervasive challenge in applied statistics. When performing cluster analysis, the standard practice of imputing a single dataset and then clustering ignores the uncertainty introduced by the imputation process, potentially leading to unstable or misleading cluster assignments.

cclustr addresses this problem by applying clustering algorithms across all imputed datasets and synthesizing the results into a single, robust consensus partition. The core idea is the co-assignment matrix: a symmetric matrix $C$ where each entry $C_{ij}$ represents the proportion of imputed datasets in which observations $i$ and $j$ were assigned to the same cluster. High values indicate stable, robust groupings across imputations; low values indicate ambiguous assignments.

The main workflow of cclustr consists of four steps:

Standardize imputed data with as_mild_list()
Cluster each imputed dataset with cluster_imputations()
Synthesize results into a consensus partition with consensus_clustering()
Validate and select the optimal number of clusters with validate_clustering() and choose_best_clustering()

All four steps can also be executed in a single call using the convenience wrapper run_mi_clustering().

Dataset: Pima Indians Diabetes

We use the Pima Indians Diabetes Database (PimaIndiansDiabetes2) from the mlbench package, which contains clinical measurements for 768 women of Pima Indian heritage. The dataset presents 652 missing values distributed across several variables, strongly motivating the use of multiple imputation before clustering.

Variables

Variable	Type	Description
`pregnant`	Numeric	Number of times pregnant
`glucose`	Numeric	Plasma glucose concentration (2h oral test)
`pressure`	Numeric	Diastolic blood pressure (mm Hg)
`triceps`	Numeric	Triceps skin fold thickness (mm)
`insulin`	Numeric	2-hour serum insulin (mu U/ml)
`mass`	Numeric	Body mass index (kg/m²)
`pedigree`	Numeric	Diabetes pedigree function
`age`	Numeric	Age in years
`diabetes`	Factor	Diabetes test result (positive/negative)

Note: The diabetes variable is excluded from the clustering analysis. This avoids circular reasoning and respects the unsupervised nature of the analysis.

Loading the data

library(cclustr)
library(mice)

if (!requireNamespace("mlbench", quietly = TRUE)) {
  stop("Package 'mlbench' is required for this vignette. ",
       "Install it with: install.packages('mlbench')")
}

data("PimaIndiansDiabetes2", package = "mlbench")

# Remove diabetes from the clustering dataset
df <- PimaIndiansDiabetes2[, -9]

cat("Dimensions:", nrow(df), "x", ncol(df), "\n")
#> Dimensions: 768 x 8
cat("\nMissing values per variable:\n")
#> 
#> Missing values per variable:
print(colSums(is.na(df)))
#> pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
#>        0        5       35      227      374       11        0        0
cat("\nTotal missing values:", sum(is.na(df)),
    "(", round(mean(is.na(df)) * 100, 1), "%)\n")
#> 
#> Total missing values: 652 ( 10.6 %)

The high proportion of missing values — particularly in insulin (48.7%) and triceps (29.6%) — makes this dataset a compelling real-world case study for multiple-imputation clustering.

Step 1: Multiple Imputation with mice

We generate 20 imputed datasets using mice with predictive mean matching (pmm). The as_mild_list() function from cclustr then converts the mids object into a standardized named list of completed data frames.


imp <- mice(df, m = 20, maxit = 20, method = "pmm", printFlag = FALSE, seed = 202604)

# Convert to cclustr standard format
mi_pima <- as_mild_list(imp)

cat("Number of imputed datasets :", length(mi_pima), "\n")
#> Number of imputed datasets : 20
cat("Dimensions of each dataset :", nrow(mi_pima[[1]]),
    "x", ncol(mi_pima[[1]]), "\n")
#> Dimensions of each dataset : 768 x 8
cat("Any missing values remaining:", any(sapply(mi_pima, anyNA)), "\n")
#> Any missing values remaining: FALSE

as_mild_list() performs strict consistency checks across all imputed datasets, ensuring identical dimensions, column names, and the absence of NA, NaN, Inf, or constant-valued columns.

Step 2: Clustering Each Imputed Dataset

Since all clustering variables are numeric, we use k-means clustering and scale_data = "global". The global scaling option computes the pooled mean and standard deviation across all imputations without stacking them in memory, ensuring that distances are on the same scale across all imputed datasets — a key requirement for a meaningful co-assignment matrix.

We evaluate $k = 2, 3, 4$ clusters simultaneously.

set.seed(202604)
partitions <- cluster_imputations(
  imp_list   = mi_pima,
  method     = "kmeans",
  k          = 2:4,
  scale_data = "global"
)

cat("Available k values:", names(partitions), "\n")
#> Available k values: k2 k3 k4
cat("Number of partitions per k:", length(partitions$k2), "\n")
#> Number of partitions per k: 20

The result is a named list of lists: one element per $k$ value (k2, k3, k4), each containing one integer vector of cluster assignments per imputed dataset.

Step 3: Consensus Clustering

consensus_clustering() builds the co-assignment matrix $C$ and derives a final partition from it. This process involves two conceptually distinct steps:

How the co-assignment matrix is constructed (consensus_method)
How the final clusters are extracted (cluster_method)

The `consensus_method` parameter

The consensus_method argument controls how each partition contributes to the co-assignment matrix $C$.

Method	Description
`"classic"`	Unweighted co-assignment. All partitions contribute equally to $C$. This is the standard approach in consensus clustering literature.
`"weighted_ari"`	Partitions are weighted according to their centrality within the ensemble, measured by their mean Adjusted Rand Index (ARI) against all other partitions. More consistent partitions have greater influence on $C$. Requires the `mclust` package.

The `cluster_method` parameter

The cluster_method argument controls the agglomeration method used in this second-stage hierarchical clustering step. It is important to distinguish it from the method argument in cluster_imputations():

method (in cluster_imputations): the algorithm used to cluster each imputed dataset (e.g., "kmeans", "ward.D2", "pam").
cluster_method (in consensus_clustering): the linkage criterion used to hierarchically cluster the co-assignment dissimilarity $1 - C$ in order to derive the final consensus partition.

The available options for cluster_method are the standard agglomeration methods from stats::hclust():

Method	Description
`"ward.D2"`	(default) Minimizes total within-cluster variance. Tends to produce compact, roughly equal-sized clusters. Recommended for most cases.
`"ward.D"`	Earlier Ward formulation; similar behavior to `"ward.D2"` but uses unsquared distances internally.
`"complete"`	Maximum linkage. Tends to produce compact clusters of similar diameter.
`"average"`	UPGMA. Compromise between single and complete; less sensitive to outliers than complete linkage.
`"single"`	Minimum linkage. Sensitive to chaining effects; not recommended for consensus matrices.
`"centroid"`	Uses the centroid of each cluster. Can produce inversions in the dendrogram.
`"median"`	Weighted centroid. Also susceptible to inversions.
`"mcquitty"`	WPGMA. Similar to average but uses equal weighting.

For consensus matrices, "ward.D2" and "complete" are generally the most appropriate choices, as they favor compact, well-separated groups that match the block-diagonal structure of a stable co-assignment matrix.

consensus_results <- consensus_clustering(
  partitions     = partitions,
  cluster_method = "ward.D2",
  consensus_method = "classic"
)

cat("Consensus solutions available:", names(consensus_results), "\n")
#> Consensus solutions available: k2 k3 k4

Step 4: Validation and Optimal k Selection

validate_clustering() computes a comprehensive set of internal and stability metrics for each candidate $k$. All metrics are computed on the consensus dissimilarity $1 - C$, making them robust to any data type.

val_table <- validate_clustering(
  partitions        = partitions,
  consensus_results = consensus_results
)

print(val_table)
#>   k       pac silhouette_mean ari_mean_between_imputations ari_consensus_mean
#> 1 2 0.2156841       0.8687290                    0.7867553          0.8138855
#> 2 3 0.2421807       0.8036448                    0.7318288          0.7933822
#> 3 4 0.2294383       0.7635850                    0.7117037          0.7807573
#>   calinski_harabasz_mean davies_bouldin_mean dunn_index
#> 1               7596.212           0.1053665  0.1176471
#> 2               3520.778           0.2050010  0.1000000
#> 3               2090.054           0.2125424  0.1052632

Metric Reference

Metric	Better when	Description
`pac`	Lower ↓	Proportion of Ambiguous Clusterings
`silhouette_mean`	Higher ↑	Mean silhouette width
`ari_mean_between_imputations`	Higher ↑	Stability across imputed datasets
`ari_consensus_mean`	Higher ↑	Agreement between consensus and each partition
`calinski_harabasz_mean`	Higher ↑	Cluster compactness and separation
`davies_bouldin_mean`	Lower ↓	Average cluster similarity
`dunn_index`	Higher ↑	Min inter-cluster / max intra-cluster distance

choose_best_clustering() aggregates all metrics using weighted rank aggregation: each metric is independently ranked and combined into a single score. The $k$ with the lowest weighted rank score is selected as optimal.

best <- choose_best_clustering(
  validation_table  = val_table,
  consensus_results = consensus_results,
  prefer_stability  = TRUE,
  tie_breaker       = "silhouette"
)

cat("Optimal k selected:", best$best_k, "\n")
#> Optimal k selected: 2
cat("\nConsensus cluster sizes:\n")
#> 
#> Consensus cluster sizes:
print(table(best$best_consensus))
#> 
#>   1   2 
#> 349 419
cat("\nWeighted rank scores:\n")
#> 
#> Weighted rank scores:
print(best$scores_table[, c("k", "pac", "silhouette_mean",
                             "ari_mean_between_imputations", "score")])
#>   k       pac silhouette_mean ari_mean_between_imputations    score
#> 1 2 0.2156841       0.8687290                    0.7867553 1.000000
#> 2 3 0.2421807       0.8036448                    0.7318288 2.285714
#> 3 4 0.2294383       0.7635850                    0.7117037 2.714286

Plotting Validation Metrics

plot_validation_metrics(best)

Validation metrics across k = 2, 3, 4. The optimal value per metric is highlighted in red.

Visualizing the Consensus Solution

All visualizations are based on the optimal k selected in the previous step, accessed through best$best_consensus_result.

Co-assignment Matrix

A well-defined consensus solution produces a block-diagonal pattern: bright blocks along the diagonal indicate observations consistently assigned to the same cluster across all imputed datasets, while dark off-diagonal regions indicate observations rarely grouped together.

plot_consensus_matrix(
  consensus_result = best,
  reorder          = TRUE,
  viridis_option   = "D",
  show_labels = FALSE
)

Co-assignment matrix for the optimal k. A clear block-diagonal structure indicates a stable consensus.

Consensus Dendrogram

The dendrogram is built from the hierarchical clustering of the consensus dissimilarity $1 - C$. Colored rectangles highlight the groups obtained by cutting the tree at the optimal $k$.

plot_consensus_dendrogram(
  consensus_result = best,
  rect             = TRUE,
  labels           = FALSE
)

Consensus dendrogram for the optimal k, derived from the dissimilarity matrix 1 - C.

Interpreting the Final Clusters

We attach the consensus labels to the first imputed dataset for a descriptive summary of each cluster’s clinical profile.

pima_clustered          <- mi_pima[[1]]
pima_clustered$cluster  <- factor(best$best_consensus)

# Mean values per cluster
num_vars <- c("pregnant", "glucose", "pressure",
              "triceps", "insulin", "mass", "pedigree", "age")

cat("Mean clinical values by cluster:\n")
#> Mean clinical values by cluster:
print(
  aggregate(
    pima_clustered[, num_vars],
    by  = list(Cluster = pima_clustered$cluster),
    FUN = function(x) round(mean(x, na.rm = TRUE), 2)
  )
)
#>   Cluster pregnant glucose pressure triceps insulin  mass pedigree   age
#> 1       1     5.42  139.72    78.48   33.87  204.56 36.02     0.52 40.36
#> 2       2     2.53  106.51    66.96   24.51  111.37 29.56     0.43 27.32

Key Variables by Cluster

par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))

boxplot(
  glucose ~ cluster,
  data = pima_clustered,
  main = "Plasma Glucose by Cluster",
  xlab = "Cluster",
  ylab = "Glucose (mg/dl)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)

boxplot(
  mass ~ cluster,
  data = pima_clustered,
  main = "Body Mass Index by Cluster",
  xlab = "Cluster",
  ylab = "BMI (kg/m²)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)

boxplot(
  insulin ~ cluster,
  data = pima_clustered,
  main = "Insulin",
  xlab = "Cluster",
  ylab = "Insulin (mu U/ml)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)

Distribution of glucose and BMI by consensus cluster.

par(mfrow = c(1, 1))

Complete Pipeline with run_mi_clustering()

All previous steps can be executed in a single call using run_mi_clustering(). This is the recommended approach for most users.

set.seed(202604)

result <- run_mi_clustering(
  data             = imp,
  method           = "kmeans",
  k                = 2:4,
  scale_data       = "global",
  consensus_method = "classic",
  prefer_stability = TRUE,
  tie_breaker      = "silhouette"
)

cat("Best k selected:", result$best_k, "\n")
#> Best k selected: 2
cat("\nCluster sizes:\n")
#> 
#> Cluster sizes:
print(table(result$best_consensus))
#> 
#>   1   2 
#> 349 419

Advanced: ARI-Weighted Consensus

By default all imputed partitions contribute equally to the co-assignment matrix (consensus_method = "classic"). When partitions vary in quality, consensus_method = "weighted_ari" weights each partition by its mean Adjusted Rand Index (ARI) against all others — so more consistent partitions contribute more to the final consensus.

set.seed(202604)

result_w <- run_mi_clustering(
  data             = imp,
  method           = "kmeans",
  k                = 2:4,
  scale_data       = "global",
  consensus_method = "weighted_ari",
  prefer_stability = TRUE,
  tie_breaker      = "silhouette"
)

cat("Best k (ARI-weighted consensus):", result_w$best_k, "\n")
#> Best k (ARI-weighted consensus): 2
print(table(result_w$best_consensus))
#> 
#>   1   2 
#> 348 420

Consensus Clustering with Multiple Imputed Data using cclustr

Anhuar Duran Mendoza, Andres Montenegro Lemus, Mario Pacheco Lopez

2026-05-12

Introduction

Dataset: Pima Indians Diabetes

Variables

Loading the data

Step 1: Multiple Imputation with mice

Step 2: Clustering Each Imputed Dataset

Step 3: Consensus Clustering

The `consensus_method` parameter

The `cluster_method` parameter

Step 4: Validation and Optimal k Selection

Metric Reference

Plotting Validation Metrics

Visualizing the Consensus Solution

Co-assignment Matrix

Consensus Dendrogram

Interpreting the Final Clusters

Key Variables by Cluster

Complete Pipeline with run_mi_clustering()

Advanced: ARI-Weighted Consensus

Consensus Clustering with Multiple Imputed Data using cclustr

Anhuar Duran Mendoza, Andres Montenegro Lemus, Mario Pacheco Lopez

2026-05-12

Introduction

Dataset: Pima Indians Diabetes

Variables

Loading the data

Step 1: Multiple Imputation with mice

Step 2: Clustering Each Imputed Dataset

Step 3: Consensus Clustering

The consensus_method parameter

The cluster_method parameter

Step 4: Validation and Optimal k Selection

Metric Reference

Plotting Validation Metrics

Visualizing the Consensus Solution

Co-assignment Matrix

Consensus Dendrogram

Interpreting the Final Clusters

Key Variables by Cluster

Complete Pipeline with run_mi_clustering()

Advanced: ARI-Weighted Consensus

The `consensus_method` parameter

The `cluster_method` parameter