---
title: "Consensus Clustering with Multiple Imputed Data using cclustr"
author: "Anhuar Duran Mendoza, Andres Montenegro Lemus, Mario Pacheco Lopez"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Consensus Clustering with Multiple Imputed Data using cclustr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.width = 7,
  fig.height = 5,
  warning   = FALSE,
  message   = FALSE
)
```

## Introduction

Missing data is a pervasive challenge in applied statistics. When performing
cluster analysis, the standard practice of imputing a single dataset and then
clustering ignores the uncertainty introduced by the imputation process,
potentially leading to unstable or misleading cluster assignments.

**cclustr** addresses this problem by applying clustering algorithms across
all imputed datasets and synthesizing the results into a single, robust
consensus partition. The core idea is the **co-assignment matrix**: a
symmetric matrix $C$ where each entry $C_{ij}$ represents the proportion of
imputed datasets in which observations $i$ and $j$ were assigned to the same
cluster. High values indicate stable, robust groupings across imputations;
low values indicate ambiguous assignments.

The main workflow of `cclustr` consists of four steps:

1. **Standardize** imputed data with `as_mild_list()`
2. **Cluster** each imputed dataset with `cluster_imputations()`
3. **Synthesize** results into a consensus partition with `consensus_clustering()`
4. **Validate and select** the optimal number of clusters with
   `validate_clustering()` and `choose_best_clustering()`

All four steps can also be executed in a single call using the convenience
wrapper `run_mi_clustering()`.

---

## Dataset: Pima Indians Diabetes

We use the **Pima Indians Diabetes Database** (`PimaIndiansDiabetes2`) from
the `mlbench` package, which contains clinical measurements for 768 women of
Pima Indian heritage. The dataset presents **652 missing values** distributed
across several variables, strongly motivating the use of multiple imputation
before clustering.

### Variables

| Variable   | Type    | Description                                      |
|------------|---------|--------------------------------------------------|
| `pregnant` | Numeric | Number of times pregnant                         |
| `glucose`  | Numeric | Plasma glucose concentration (2h oral test)      |
| `pressure` | Numeric | Diastolic blood pressure (mm Hg)                 |
| `triceps`  | Numeric | Triceps skin fold thickness (mm)                 |
| `insulin`  | Numeric | 2-hour serum insulin (mu U/ml)                   |
| `mass`     | Numeric | Body mass index (kg/m²)                          |
| `pedigree` | Numeric | Diabetes pedigree function                       |
| `age`      | Numeric | Age in years                                     |
| `diabetes` | Factor  | Diabetes test result (positive/negative)         |

> **Note:** The `diabetes` variable is excluded from the clustering analysis.
> This avoids circular reasoning and respects the unsupervised nature of the
> analysis.

### Loading the data

```{r load-data}
library(cclustr)
library(mice)

if (!requireNamespace("mlbench", quietly = TRUE)) {
  stop("Package 'mlbench' is required for this vignette. ",
       "Install it with: install.packages('mlbench')")
}

data("PimaIndiansDiabetes2", package = "mlbench")

# Remove diabetes from the clustering dataset
df <- PimaIndiansDiabetes2[, -9]

cat("Dimensions:", nrow(df), "x", ncol(df), "\n")
cat("\nMissing values per variable:\n")
print(colSums(is.na(df)))
cat("\nTotal missing values:", sum(is.na(df)),
    "(", round(mean(is.na(df)) * 100, 1), "%)\n")
```

The high proportion of missing values — particularly in `insulin` (48.7%) and
`triceps` (29.6%) — makes this dataset a compelling real-world case study for
multiple-imputation clustering.

---

## Step 1: Multiple Imputation with mice

We generate 20 imputed datasets using `mice` with predictive mean matching
(`pmm`). The `as_mild_list()` function from `cclustr` then converts the
`mids` object into a standardized named list of completed data frames.

```{r imputation}

imp <- mice(df, m = 20, maxit = 20, method = "pmm", printFlag = FALSE, seed = 202604)

# Convert to cclustr standard format
mi_pima <- as_mild_list(imp)

cat("Number of imputed datasets :", length(mi_pima), "\n")
cat("Dimensions of each dataset :", nrow(mi_pima[[1]]),
    "x", ncol(mi_pima[[1]]), "\n")
cat("Any missing values remaining:", any(sapply(mi_pima, anyNA)), "\n")
```

`as_mild_list()` performs strict consistency checks across all imputed
datasets, ensuring identical dimensions, column names, and the absence of
`NA`, `NaN`, `Inf`, or constant-valued columns.

---

## Step 2: Clustering Each Imputed Dataset

Since all clustering variables are numeric, we use **k-means
clustering** and `scale_data = "global"`. The
global scaling option computes the pooled mean and standard deviation across
all imputations without stacking them in memory, ensuring that distances are
on the same scale across all imputed datasets — a key requirement for a
meaningful co-assignment matrix.

We evaluate $k = 2, 3, 4$ clusters simultaneously.

```{r clustering}
set.seed(202604)
partitions <- cluster_imputations(
  imp_list   = mi_pima,
  method     = "kmeans",
  k          = 2:4,
  scale_data = "global"
)

cat("Available k values:", names(partitions), "\n")
cat("Number of partitions per k:", length(partitions$k2), "\n")
```

The result is a named list of lists: one element per $k$ value (`k2`, `k3`,
`k4`), each containing one integer vector of cluster assignments per imputed
dataset.

---

## Step 3: Consensus Clustering

`consensus_clustering()` builds the **co-assignment matrix** $C$ and derives
a final partition from it. This process involves **two conceptually distinct
steps**:

1. **How the co-assignment matrix is constructed** (`consensus_method`)
2. **How the final clusters are extracted** (`cluster_method`)

### The `consensus_method` parameter

The `consensus_method` argument controls **how each partition contributes
to the co-assignment matrix** $C$.

| Method            | Description |
|------------------|------------|
| `"classic"`      | Unweighted co-assignment. All partitions contribute equally to $C$. This is the standard approach in consensus clustering literature. |
| `"weighted_ari"` | Partitions are weighted according to their **centrality** within the ensemble, measured by their mean Adjusted Rand Index (ARI) against all other partitions. More consistent partitions have greater influence on $C$. Requires the `mclust` package. |

### The `cluster_method` parameter

The `cluster_method` argument controls the **agglomeration method** used
in this second-stage hierarchical clustering step. It is important to
distinguish it from the `method` argument in `cluster_imputations()`:

- `method` (in `cluster_imputations`): the algorithm used to cluster each
  imputed dataset (e.g., `"kmeans"`, `"ward.D2"`, `"pam"`).
- `cluster_method` (in `consensus_clustering`): the linkage criterion used
  to hierarchically cluster the **co-assignment dissimilarity** $1 - C$ in
  order to derive the final consensus partition.

The available options for `cluster_method` are the standard agglomeration
methods from `stats::hclust()`:

| Method        | Description                                                         |
|---------------|---------------------------------------------------------------------|
| `"ward.D2"`   | (default) Minimizes total within-cluster variance. Tends to produce compact, roughly equal-sized clusters. Recommended for most cases. |
| `"ward.D"`    | Earlier Ward formulation; similar behavior to `"ward.D2"` but uses unsquared distances internally. |
| `"complete"`  | Maximum linkage. Tends to produce compact clusters of similar diameter. |
| `"average"`   | UPGMA. Compromise between single and complete; less sensitive to outliers than complete linkage. |
| `"single"`    | Minimum linkage. Sensitive to chaining effects; not recommended for consensus matrices. |
| `"centroid"`  | Uses the centroid of each cluster. Can produce inversions in the dendrogram. |
| `"median"`    | Weighted centroid. Also susceptible to inversions. |
| `"mcquitty"`  | WPGMA. Similar to average but uses equal weighting. |

For consensus matrices, `"ward.D2"` and `"complete"` are generally the
most appropriate choices, as they favor compact, well-separated groups
that match the block-diagonal structure of a stable co-assignment matrix.

```{r consensus}
consensus_results <- consensus_clustering(
  partitions     = partitions,
  cluster_method = "ward.D2",
  consensus_method = "classic"
)

cat("Consensus solutions available:", names(consensus_results), "\n")
```

---

## Step 4: Validation and Optimal k Selection

`validate_clustering()` computes a comprehensive set of internal and stability
metrics for each candidate $k$. All metrics are computed on the consensus
dissimilarity $1 - C$, making them robust to any data type.

```{r validation}
val_table <- validate_clustering(
  partitions        = partitions,
  consensus_results = consensus_results
)

print(val_table)
```

### Metric Reference

| Metric                          | Better when | Description                                    |
|---------------------------------|-------------|------------------------------------------------|
| `pac`                           | Lower ↓     | Proportion of Ambiguous Clusterings            |
| `silhouette_mean`               | Higher ↑    | Mean silhouette width                          |
| `ari_mean_between_imputations`  | Higher ↑    | Stability across imputed datasets              |
| `ari_consensus_mean`            | Higher ↑    | Agreement between consensus and each partition |
| `calinski_harabasz_mean`        | Higher ↑    | Cluster compactness and separation             |
| `davies_bouldin_mean`           | Lower ↓     | Average cluster similarity                     |
| `dunn_index`                    | Higher ↑    | Min inter-cluster / max intra-cluster distance |

`choose_best_clustering()` aggregates all metrics using **weighted rank
aggregation**: each metric is independently ranked and combined into a single
score. The $k$ with the lowest weighted rank score is selected as optimal.

```{r selection}
best <- choose_best_clustering(
  validation_table  = val_table,
  consensus_results = consensus_results,
  prefer_stability  = TRUE,
  tie_breaker       = "silhouette"
)

cat("Optimal k selected:", best$best_k, "\n")
cat("\nConsensus cluster sizes:\n")
print(table(best$best_consensus))
cat("\nWeighted rank scores:\n")
print(best$scores_table[, c("k", "pac", "silhouette_mean",
                             "ari_mean_between_imputations", "score")])
```

### Plotting Validation Metrics

```{r validation-plot, fig.cap = "Validation metrics across k = 2, 3, 4. The optimal value per metric is highlighted in red."}
plot_validation_metrics(best)
```

---

## Visualizing the Consensus Solution

All visualizations are based on the **optimal k** selected in the previous
step, accessed through `best$best_consensus_result`.

### Co-assignment Matrix

A well-defined consensus solution produces a **block-diagonal pattern**:
bright blocks along the diagonal indicate observations consistently assigned
to the same cluster across all imputed datasets, while dark off-diagonal
regions indicate observations rarely grouped together.

```{r heatmap, fig.cap = "Co-assignment matrix for the optimal k. A clear block-diagonal structure indicates a stable consensus."}
plot_consensus_matrix(
  consensus_result = best,
  reorder          = TRUE,
  viridis_option   = "D",
  show_labels = FALSE
)
```

### Consensus Dendrogram

The dendrogram is built from the hierarchical clustering of the consensus
dissimilarity $1 - C$. Colored rectangles highlight the groups obtained by
cutting the tree at the optimal $k$.

```{r dendrogram, fig.cap = "Consensus dendrogram for the optimal k, derived from the dissimilarity matrix 1 - C."}
plot_consensus_dendrogram(
  consensus_result = best,
  rect             = TRUE,
  labels           = FALSE
)
```

---

## Interpreting the Final Clusters

We attach the consensus labels to the first imputed dataset for a descriptive summary of each cluster's clinical profile.

```{r cluster-summary}
pima_clustered          <- mi_pima[[1]]
pima_clustered$cluster  <- factor(best$best_consensus)

# Mean values per cluster
num_vars <- c("pregnant", "glucose", "pressure",
              "triceps", "insulin", "mass", "pedigree", "age")

cat("Mean clinical values by cluster:\n")
print(
  aggregate(
    pima_clustered[, num_vars],
    by  = list(Cluster = pima_clustered$cluster),
    FUN = function(x) round(mean(x, na.rm = TRUE), 2)
  )
)
```

### Key Variables by Cluster

```{r boxplots, fig.cap = "Distribution of glucose and BMI by consensus cluster."}
par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))

boxplot(
  glucose ~ cluster,
  data = pima_clustered,
  main = "Plasma Glucose by Cluster",
  xlab = "Cluster",
  ylab = "Glucose (mg/dl)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)

boxplot(
  mass ~ cluster,
  data = pima_clustered,
  main = "Body Mass Index by Cluster",
  xlab = "Cluster",
  ylab = "BMI (kg/m²)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)

boxplot(
  insulin ~ cluster,
  data = pima_clustered,
  main = "Insulin",
  xlab = "Cluster",
  ylab = "Insulin (mu U/ml)",
  col  = c("#4E79A7", "#F28E2B", "#E15759")
)
par(mfrow = c(1, 1))
```

---

## Complete Pipeline with run_mi_clustering()

All previous steps can be executed in a single call using `run_mi_clustering()`.
This is the recommended approach for most users.

```{r full-pipeline}
set.seed(202604)

result <- run_mi_clustering(
  data             = imp,
  method           = "kmeans",
  k                = 2:4,
  scale_data       = "global",
  consensus_method = "classic",
  prefer_stability = TRUE,
  tie_breaker      = "silhouette"
)

cat("Best k selected:", result$best_k, "\n")
cat("\nCluster sizes:\n")
print(table(result$best_consensus))
```

---

## Advanced: ARI-Weighted Consensus

By default all imputed partitions contribute equally to the co-assignment
matrix (`consensus_method = "classic"`). When partitions vary in quality,
`consensus_method = "weighted_ari"` weights each partition by its mean
Adjusted Rand Index (ARI) against all others — so more consistent partitions
contribute more to the final consensus.

```{r weighted-ari, eval = requireNamespace("mclust", quietly = TRUE)}
set.seed(202604)

result_w <- run_mi_clustering(
  data             = imp,
  method           = "kmeans",
  k                = 2:4,
  scale_data       = "global",
  consensus_method = "weighted_ari",
  prefer_stability = TRUE,
  tie_breaker      = "silhouette"
)

cat("Best k (ARI-weighted consensus):", result_w$best_k, "\n")
print(table(result_w$best_consensus))
```