Missing data is a pervasive challenge in applied statistics. When performing cluster analysis, the standard practice of imputing a single dataset and then clustering ignores the uncertainty introduced by the imputation process, potentially leading to unstable or misleading cluster assignments.
cclustr addresses this problem by applying clustering algorithms across all imputed datasets and synthesizing the results into a single, robust consensus partition. The core idea is the co-assignment matrix: a symmetric matrix \(C\) where each entry \(C_{ij}\) represents the proportion of imputed datasets in which observations \(i\) and \(j\) were assigned to the same cluster. High values indicate stable, robust groupings across imputations; low values indicate ambiguous assignments.
The main workflow of cclustr consists of four steps:
as_mild_list()cluster_imputations()consensus_clustering()validate_clustering() and
choose_best_clustering()All four steps can also be executed in a single call using the
convenience wrapper run_mi_clustering().
We use the Pima Indians Diabetes Database
(PimaIndiansDiabetes2) from the mlbench
package, which contains clinical measurements for 768 women of Pima
Indian heritage. The dataset presents 652 missing
values distributed across several variables, strongly
motivating the use of multiple imputation before clustering.
| Variable | Type | Description |
|---|---|---|
pregnant |
Numeric | Number of times pregnant |
glucose |
Numeric | Plasma glucose concentration (2h oral test) |
pressure |
Numeric | Diastolic blood pressure (mm Hg) |
triceps |
Numeric | Triceps skin fold thickness (mm) |
insulin |
Numeric | 2-hour serum insulin (mu U/ml) |
mass |
Numeric | Body mass index (kg/m²) |
pedigree |
Numeric | Diabetes pedigree function |
age |
Numeric | Age in years |
diabetes |
Factor | Diabetes test result (positive/negative) |
Note: The
diabetesvariable is excluded from the clustering analysis. This avoids circular reasoning and respects the unsupervised nature of the analysis.
library(cclustr)
library(mice)
if (!requireNamespace("mlbench", quietly = TRUE)) {
stop("Package 'mlbench' is required for this vignette. ",
"Install it with: install.packages('mlbench')")
}
data("PimaIndiansDiabetes2", package = "mlbench")
# Remove diabetes from the clustering dataset
df <- PimaIndiansDiabetes2[, -9]
cat("Dimensions:", nrow(df), "x", ncol(df), "\n")
#> Dimensions: 768 x 8
cat("\nMissing values per variable:\n")
#>
#> Missing values per variable:
print(colSums(is.na(df)))
#> pregnant glucose pressure triceps insulin mass pedigree age
#> 0 5 35 227 374 11 0 0
cat("\nTotal missing values:", sum(is.na(df)),
"(", round(mean(is.na(df)) * 100, 1), "%)\n")
#>
#> Total missing values: 652 ( 10.6 %)The high proportion of missing values — particularly in
insulin (48.7%) and triceps (29.6%) — makes
this dataset a compelling real-world case study for multiple-imputation
clustering.
We generate 20 imputed datasets using mice with
predictive mean matching (pmm). The
as_mild_list() function from cclustr then
converts the mids object into a standardized named list of
completed data frames.
imp <- mice(df, m = 20, maxit = 20, method = "pmm", printFlag = FALSE, seed = 202604)
# Convert to cclustr standard format
mi_pima <- as_mild_list(imp)
cat("Number of imputed datasets :", length(mi_pima), "\n")
#> Number of imputed datasets : 20
cat("Dimensions of each dataset :", nrow(mi_pima[[1]]),
"x", ncol(mi_pima[[1]]), "\n")
#> Dimensions of each dataset : 768 x 8
cat("Any missing values remaining:", any(sapply(mi_pima, anyNA)), "\n")
#> Any missing values remaining: FALSEas_mild_list() performs strict consistency checks across
all imputed datasets, ensuring identical dimensions, column names, and
the absence of NA, NaN, Inf, or
constant-valued columns.
Since all clustering variables are numeric, we use k-means
clustering and scale_data = "global". The global
scaling option computes the pooled mean and standard deviation across
all imputations without stacking them in memory, ensuring that distances
are on the same scale across all imputed datasets — a key requirement
for a meaningful co-assignment matrix.
We evaluate \(k = 2, 3, 4\) clusters simultaneously.
set.seed(202604)
partitions <- cluster_imputations(
imp_list = mi_pima,
method = "kmeans",
k = 2:4,
scale_data = "global"
)
cat("Available k values:", names(partitions), "\n")
#> Available k values: k2 k3 k4
cat("Number of partitions per k:", length(partitions$k2), "\n")
#> Number of partitions per k: 20The result is a named list of lists: one element per \(k\) value (k2,
k3, k4), each containing one integer vector of
cluster assignments per imputed dataset.
consensus_clustering() builds the co-assignment
matrix \(C\) and derives a
final partition from it. This process involves two conceptually
distinct steps:
consensus_method)cluster_method)consensus_method parameterThe consensus_method argument controls how each
partition contributes to the co-assignment matrix \(C\).
| Method | Description |
|---|---|
"classic" |
Unweighted co-assignment. All partitions contribute equally to \(C\). This is the standard approach in consensus clustering literature. |
"weighted_ari" |
Partitions are weighted according to their
centrality within the ensemble, measured by their mean
Adjusted Rand Index (ARI) against all other partitions. More consistent
partitions have greater influence on \(C\). Requires the mclust
package. |
cluster_method parameterThe cluster_method argument controls the
agglomeration method used in this second-stage
hierarchical clustering step. It is important to distinguish it from the
method argument in cluster_imputations():
method (in cluster_imputations): the
algorithm used to cluster each imputed dataset (e.g.,
"kmeans", "ward.D2", "pam").cluster_method (in consensus_clustering):
the linkage criterion used to hierarchically cluster the
co-assignment dissimilarity \(1 - C\) in order to derive the final
consensus partition.The available options for cluster_method are the
standard agglomeration methods from stats::hclust():
| Method | Description |
|---|---|
"ward.D2" |
(default) Minimizes total within-cluster variance. Tends to produce compact, roughly equal-sized clusters. Recommended for most cases. |
"ward.D" |
Earlier Ward formulation; similar behavior to "ward.D2"
but uses unsquared distances internally. |
"complete" |
Maximum linkage. Tends to produce compact clusters of similar diameter. |
"average" |
UPGMA. Compromise between single and complete; less sensitive to outliers than complete linkage. |
"single" |
Minimum linkage. Sensitive to chaining effects; not recommended for consensus matrices. |
"centroid" |
Uses the centroid of each cluster. Can produce inversions in the dendrogram. |
"median" |
Weighted centroid. Also susceptible to inversions. |
"mcquitty" |
WPGMA. Similar to average but uses equal weighting. |
For consensus matrices, "ward.D2" and
"complete" are generally the most appropriate choices, as
they favor compact, well-separated groups that match the block-diagonal
structure of a stable co-assignment matrix.
consensus_results <- consensus_clustering(
partitions = partitions,
cluster_method = "ward.D2",
consensus_method = "classic"
)
cat("Consensus solutions available:", names(consensus_results), "\n")
#> Consensus solutions available: k2 k3 k4validate_clustering() computes a comprehensive set of
internal and stability metrics for each candidate \(k\). All metrics are computed on the
consensus dissimilarity \(1 - C\),
making them robust to any data type.
val_table <- validate_clustering(
partitions = partitions,
consensus_results = consensus_results
)
print(val_table)
#> k pac silhouette_mean ari_mean_between_imputations ari_consensus_mean
#> 1 2 0.2156841 0.8687290 0.7867553 0.8138855
#> 2 3 0.2421807 0.8036448 0.7318288 0.7933822
#> 3 4 0.2294383 0.7635850 0.7117037 0.7807573
#> calinski_harabasz_mean davies_bouldin_mean dunn_index
#> 1 7596.212 0.1053665 0.1176471
#> 2 3520.778 0.2050010 0.1000000
#> 3 2090.054 0.2125424 0.1052632| Metric | Better when | Description |
|---|---|---|
pac |
Lower ↓ | Proportion of Ambiguous Clusterings |
silhouette_mean |
Higher ↑ | Mean silhouette width |
ari_mean_between_imputations |
Higher ↑ | Stability across imputed datasets |
ari_consensus_mean |
Higher ↑ | Agreement between consensus and each partition |
calinski_harabasz_mean |
Higher ↑ | Cluster compactness and separation |
davies_bouldin_mean |
Lower ↓ | Average cluster similarity |
dunn_index |
Higher ↑ | Min inter-cluster / max intra-cluster distance |
choose_best_clustering() aggregates all metrics using
weighted rank aggregation: each metric is independently
ranked and combined into a single score. The \(k\) with the lowest weighted rank score is
selected as optimal.
best <- choose_best_clustering(
validation_table = val_table,
consensus_results = consensus_results,
prefer_stability = TRUE,
tie_breaker = "silhouette"
)
cat("Optimal k selected:", best$best_k, "\n")
#> Optimal k selected: 2
cat("\nConsensus cluster sizes:\n")
#>
#> Consensus cluster sizes:
print(table(best$best_consensus))
#>
#> 1 2
#> 349 419
cat("\nWeighted rank scores:\n")
#>
#> Weighted rank scores:
print(best$scores_table[, c("k", "pac", "silhouette_mean",
"ari_mean_between_imputations", "score")])
#> k pac silhouette_mean ari_mean_between_imputations score
#> 1 2 0.2156841 0.8687290 0.7867553 1.000000
#> 2 3 0.2421807 0.8036448 0.7318288 2.285714
#> 3 4 0.2294383 0.7635850 0.7117037 2.714286All visualizations are based on the optimal k
selected in the previous step, accessed through
best$best_consensus_result.
A well-defined consensus solution produces a block-diagonal pattern: bright blocks along the diagonal indicate observations consistently assigned to the same cluster across all imputed datasets, while dark off-diagonal regions indicate observations rarely grouped together.
plot_consensus_matrix(
consensus_result = best,
reorder = TRUE,
viridis_option = "D",
show_labels = FALSE
)The dendrogram is built from the hierarchical clustering of the consensus dissimilarity \(1 - C\). Colored rectangles highlight the groups obtained by cutting the tree at the optimal \(k\).
We attach the consensus labels to the first imputed dataset for a descriptive summary of each cluster’s clinical profile.
pima_clustered <- mi_pima[[1]]
pima_clustered$cluster <- factor(best$best_consensus)
# Mean values per cluster
num_vars <- c("pregnant", "glucose", "pressure",
"triceps", "insulin", "mass", "pedigree", "age")
cat("Mean clinical values by cluster:\n")
#> Mean clinical values by cluster:
print(
aggregate(
pima_clustered[, num_vars],
by = list(Cluster = pima_clustered$cluster),
FUN = function(x) round(mean(x, na.rm = TRUE), 2)
)
)
#> Cluster pregnant glucose pressure triceps insulin mass pedigree age
#> 1 1 5.42 139.72 78.48 33.87 204.56 36.02 0.52 40.36
#> 2 2 2.53 106.51 66.96 24.51 111.37 29.56 0.43 27.32par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))
boxplot(
glucose ~ cluster,
data = pima_clustered,
main = "Plasma Glucose by Cluster",
xlab = "Cluster",
ylab = "Glucose (mg/dl)",
col = c("#4E79A7", "#F28E2B", "#E15759")
)
boxplot(
mass ~ cluster,
data = pima_clustered,
main = "Body Mass Index by Cluster",
xlab = "Cluster",
ylab = "BMI (kg/m²)",
col = c("#4E79A7", "#F28E2B", "#E15759")
)
boxplot(
insulin ~ cluster,
data = pima_clustered,
main = "Insulin",
xlab = "Cluster",
ylab = "Insulin (mu U/ml)",
col = c("#4E79A7", "#F28E2B", "#E15759")
)All previous steps can be executed in a single call using
run_mi_clustering(). This is the recommended approach for
most users.
set.seed(202604)
result <- run_mi_clustering(
data = imp,
method = "kmeans",
k = 2:4,
scale_data = "global",
consensus_method = "classic",
prefer_stability = TRUE,
tie_breaker = "silhouette"
)
cat("Best k selected:", result$best_k, "\n")
#> Best k selected: 2
cat("\nCluster sizes:\n")
#>
#> Cluster sizes:
print(table(result$best_consensus))
#>
#> 1 2
#> 349 419By default all imputed partitions contribute equally to the
co-assignment matrix (consensus_method = "classic"). When
partitions vary in quality,
consensus_method = "weighted_ari" weights each partition by
its mean Adjusted Rand Index (ARI) against all others — so more
consistent partitions contribute more to the final consensus.
set.seed(202604)
result_w <- run_mi_clustering(
data = imp,
method = "kmeans",
k = 2:4,
scale_data = "global",
consensus_method = "weighted_ari",
prefer_stability = TRUE,
tie_breaker = "silhouette"
)
cat("Best k (ARI-weighted consensus):", result_w$best_k, "\n")
#> Best k (ARI-weighted consensus): 2
print(table(result_w$best_consensus))
#>
#> 1 2
#> 348 420