Type: | Package |
Title: | Proximity Measure Based Diagnostics for Standard, Soft, and Multi-Way Clustering |
Version: | 0.9.4 |
Language: | en-US |
Maintainer: | Shrikrishna Bhat K <skbhat.in@gmail.com> |
Description: | Quantifies clustering quality by measuring both cohesion within clusters and separation between clusters. Implements advanced silhouette width computations for diverse clustering structures, including: simplified silhouette (Van der Laan et al., 2003) <doi:10.1080/0094965031000136012>, Probability of Alternative Cluster normalization methods (Raymaekers & Rousseeuw, 2022) <doi:10.1080/10618600.2022.2050249>, fuzzy clustering and silhouette diagnostics using membership probabilities (Campello & Hruschka, 2006; Bhat & Kiruthika, 2024) <doi:10.1016/j.fss.2006.07.006>, <doi:10.1080/23737484.2024.2408534>, and multi-way clustering extensions such as block and tensor clustering (Schepers et al., 2008; Bhat & Kiruthika, 2025) <doi:10.1007/s00357-008-9005-9>, <doi:10.21203/rs.3.rs-6973596/v1>. Provides tools for computation and visualization (Rousseeuw, 1987) <doi:10.1016/0377-0427(87)90125-7> to support robust and reproducible cluster diagnostics across standard, soft, and multi-way clustering settings. |
License: | GPL-2 |
URL: | https://kskbhat.github.io/Silhouette/ |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | dplyr, ggplot2, ggpubr, methods |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0), ppclust, blockcluster, cluster, factoextra, proxy |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2025-07-28 09:02:31 UTC; Gubbi |
Author: | Shrikrishna Bhat K
|
Repository: | CRAN |
Date/Publication: | 2025-07-30 08:20:02 UTC |
Calculate Silhouette Widths, Summary, and Plot for Clustering Results
Description
Computes the silhouette width for each observation based on clustering results, measuring how similar an observation is to its own cluster compared to nearest neighbor cluster. The silhouette width ranges from -1 to 1, where higher values indicate better cluster cohesion and separation.
Usage
Silhouette(
prox_matrix,
proximity_type = c("dissimilarity", "similarity"),
method = c("medoid", "pac"),
prob_matrix = NULL,
a = 2,
sort = FALSE,
print.summary = FALSE,
clust_fun = NULL,
...
)
## S3 method for class 'Silhouette'
plot(x, label = FALSE, summary.legend = TRUE, grayscale = FALSE, ...)
## S3 method for class 'Silhouette'
summary(object, print.summary = TRUE, ...)
Arguments
prox_matrix |
A numeric matrix where rows represent observations and columns represent proximity measures (e.g., distances or similarities) to clusters. Typically, this is a membership or dissimilarity matrix from clustering results. If |
proximity_type |
Character string specifying the type of proximity measure in |
method |
Character string specifying the silhouette calculation method. Options are |
prob_matrix |
A numeric matrix where rows represent observations and columns represent cluster membership probabilities, depending on |
a |
Numeric value controlling the fuzzifier or weight scaling in fuzzy silhouette averaging. Higher values increase the emphasis on strong membership differences. Must be positive. Defaults to |
sort |
Logical; if |
print.summary |
Logical; if |
clust_fun |
Optional S3 or S4 function object or function as character string specifying a clustering function that produces the proximity measure matrix. For example, |
... |
Additional arguments passed to |
x |
An object of class |
label |
Logical; if |
summary.legend |
Logical; if |
grayscale |
Logical; if |
object |
An object of class |
Details
The Silhouette
function implements the Simplified Silhouette method introduced by Van der Laan, Pollard, & Bryan (2003), which adapts and generalizes the classic silhouette method of Rousseeuw (1987).
Clustering quality is evaluated using a proximity matrix, denoted as
\Delta = [\delta_{ik}]_{n \times K}
for dissimilarity measures or
\Delta' = [\delta'_{ik}]_{n \times K}
for similarity measures.
Here, i = 1, \ldots, n
indexes observations, and k = 1, \ldots, K
indexes clusters.
\delta_{ik}
represents the dissimilarity (e.g., distance) between observation i
and cluster k
,
while \delta'_{ik}
represents similarity values.
The silhouette width S(x_i)
for observation i
depends on the proximity type:
For dissimilarity measures:
S(x_i) = \frac{ \min_{k' \neq k} \delta_{ik'} - \delta_{ik} }{ N(x_i) }
For similarity measures:
S(x_i) = \frac{ \delta'_{ik} - \max_{k' \neq k} \delta'_{ik'} }{ N(x_i) }
where N(x_i)
is a normalizing factor defined by the method.
Choice of method:
The normalizer N(x_i)
is selected according to the method
argument. The method names reference their origins but may be used with any proximity matrix, not exclusively certain clustering algorithms:
For
medoid
(Van der Laan et al., 2003):Dissimilarity:
\max(\delta_{ik}, \min_{k' \neq k} \delta_{ik'})
Similarity:
\max(\delta'_{ik}, \max_{k' \neq k} \delta'_{ik'})
For
pac
(Raymaekers & Rousseeuw, 2022):Dissimilarity:
\delta_{ik} + \min_{k' \neq k} \delta_{ik'}
Similarity:
\delta'_{ik} + \max_{k' \neq k} \delta'_{ik'}
Note:
The "medoid"
and "pac"
options reflect the normalization formula—not a requirement to use the PAM algorithm or posterior/ensemble methods—and are general scoring approaches. These methods can be applied to any suitable proximity matrix, including proximity, similarity, or dissimilarity matrices derived from classification algorithms. This flexibility means silhouette indices may be computed to assess group separation when clusters or groups are formed from classification-derived proximities, not only from unsupervised clustering.
If prob_matrix
is NULL
, the crisp silhouette index (CS
) is:
CS = \frac{1}{n} \sum_{i=1}^{n} S(x_i)
summarizing overall clustering quality.
If prob_matrix
is provided, denoted as \Gamma = [\gamma_{ik}]_{n \times K}
,
with \gamma_{ik}
representing the probability of observation i
belonging to cluster k
,
the fuzzy silhouette index (FS
) is used:
FS = \frac{ \sum_{i=1}^{n} \left( \gamma_{ik} - \max_{k' \neq k} \gamma_{ik'} \right)^{\alpha} S(x_i) }{ \sum_{i=1}^{n} \left( \gamma_{ik} - \max_{k' \neq k} \gamma_{ik'} \right)^{\alpha} }
where \alpha
(the a
argument) controls the emphasis on confident assignments.
Value
A data frame of class "Silhouette"
containing cluster assignments, nearest neighbor clusters, silhouette widths for each observation, and weights (for fuzzy clustering). The object includes the following attributes:
- proximity_type
The proximity type used (
"similarity"
or"dissimilarity"
).- method
The silhouette calculation method used (
"medoid"
or"pac"
).
Further, summary
returns a list containing:
-
clus.avg.widths
: A named numeric vector of average silhouette widths per cluster. -
avg.width
: The overall average silhouette width. -
sil.sum
: A data frame with columnscluster
,size
, andavg.sil.width
summarizing cluster sizes and average silhouette widths.
References
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. doi:10.1016/0377-0427(87)90125-7
Van der Laan, M., Pollard, K., & Bryan, J. (2003). A new partitioning around medoids algorithm. Journal of Statistical Computation and Simulation, 73(8), 575–584. doi:10.1080/0094965031000136012
Campello, R. J., & Hruschka, E. R. (2006). A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets and Systems, 157(21), 2858–2875. doi:10.1016/j.fss.2006.07.006
Raymaekers, J., & Rousseeuw, P. J. (2022). Silhouettes and quasi residual plots for neural nets and tree-based classifiers. Journal of Computational and Graphical Statistics, 31(4), 1332–1343. doi:10.1080/10618600.2022.2050249
Bhat Kapu, S., & Kiruthika. (2024). Some density-based silhouette diagnostics for soft clustering algorithms. Communications in Statistics: Case Studies, Data Analysis and Applications, 10(3-4), 221-238. doi:10.1080/23737484.2024.2408534
See Also
softSilhouette
, plotSilhouette
Examples
# Standard silhouette with k-means on iris dataset
data(iris)
# Crisp Silhouette with k-means
out <- kmeans(iris[, -5], 3)
if (requireNamespace("ppclust", quietly = TRUE)) {
library(proxy)
dist <- proxy::dist(iris[, -5], out$centers)
silh_out <- Silhouette(dist,print.summary = TRUE)
plot(silh_out)
} else {
message("Install 'ppclust': install.packages('ppclust')")
}
# Scree plot for optimal clusters (2 to 7)
if (requireNamespace("ppclust", quietly = TRUE)) {
library(ppclust)
avg_sil_width <- numeric(6)
for (k in 2:7) {
out <- Silhouette(
prox_matrix = "d",
proximity_type = "dissimilarity",
prob_matrix = "u",
clust_fun = ppclust::fcm,
x = iris[, 1:4],
centers = k,
sort = TRUE
)
# Compute average silhouette width from widths
avg_sil_width[k - 1] <- summary(out, print.summary = FALSE)$avg.width
}
plot(avg_sil_width,
type = "o",
ylab = "Overall Silhouette Width",
xlab = "Number of Clusters",
main = "Scree Plot"
)
} else {
message("Install 'ppclust': install.packages('ppclust')")
}
Calculate Extended Silhouette Width for Multi-Way Clustering
Description
Computes an extended silhouette width for multi-way clustering (e.g., biclustering, triclustering, or n-mode tensor clustering) by combining silhouette widths from a list of Silhouette objects, each representing one mode of clustering. The extended silhouette width is the weighted average of the average silhouette widths from each mode, weighted by the number of observations in each mode's silhouette analysis. The output is an object of class extSilhouette
.
Usage
extSilhouette(sil_list, dim_names = NULL, print.summary = FALSE)
Arguments
sil_list |
A list of objects of class |
dim_names |
An optional character vector of dimension names (e.g., |
print.summary |
Logical; if |
Details
The extended silhouette width is computed as:
ExS = \frac{ \sum (n_i \cdot w_i) }{ \sum n_i }
where n_i
is the number of observations in mode i
(derived from nrow(x$widths)
), and w_i
is the average silhouette width for that mode (from x$avg.width
).
Each Silhouette
object in sil_list
must contain a non-empty widths
data frame and a numeric avg.width
value. Modes with zero observations (n_i = 0
) are not allowed, as they would result in an undefined weighted average. For consistency make sure all Silhouette
objects derived from same method
and arguments.
Value
A list of class "extSilhouette"
with the following components:
- ext_sil_width
A numeric scalar representing the extended silhouette width.
- dim_table
A data frame with columns
dimension
(e.g., "Mode 1", "Mode 2"),n_obs
(number of observations), andavg_sil_width
(average silhouette width for each mode).
References
Schepers, J., Ceulemans, E., & Van Mechelen, I. (2008). Selecting among multi-mode partitioning models of different complexities: A comparison of four model selection criteria. Journal of Classification, 25(1), 67–85. doi:10.1007/s00357-008-9005-9
Bhat Kapu, S., & Kiruthika, C. (2025). Block Probabilistic Distance Clustering: A Unified Framework and Evaluation. PREPRINT (Version 1) available at Research Square. doi:10.21203/rs.3.rs-6973596/v1
See Also
Examples
# Example using iris dataset with two modes
data(iris)
if (requireNamespace("blockcluster", quietly = TRUE)) {
library(blockcluster)
result <- coclusterContinuous(
as.matrix(iris[, -5]),
nbcocluster = c(3, 2)
)
} else {
message("Install 'blockcluster': install.packages('blockcluster')")
}
if (requireNamespace("blockcluster", quietly = TRUE)) {
sil_mode1 <- softSilhouette(
prob_matrix = result@rowposteriorprob,
method = "pac")
sil_mode2 <- softSilhouette(
prob_matrix = result@colposteriorprob,
method = "pac"
)
# Extended silhouette
ext_sil <- extSilhouette(list(sil_mode1, sil_mode2),print.summary = TRUE)
}
Plot Silhouette Analysis Results
Description
Creates a silhouette plot for visualizing the silhouette widths of clustering results, with bars colored by cluster and an optional summary of cluster statistics in legend.
Usage
plotSilhouette(x, label = FALSE, summary.legend = TRUE, grayscale = FALSE, ...)
Arguments
x |
An object of class |
label |
Logical; if |
summary.legend |
Logical; if |
grayscale |
Logical; if |
... |
Additional arguments passed to |
Details
The Silhouette plot displays the silhouette width (sil_width
) for each observation, grouped by cluster, with bars sorted by cluster and descending silhouette width. The summary.legend
option adds cluster sizes and average silhouette widths to the legend.
This function replica of S3 method for objects of class "Silhouette"
, typically produced by the Silhouette
or softSilhouette
functions in this package. It also supports objects of the following classes, with silhouette information extracted from their respective component:
-
"eclust"
: Produced byeclust
from the factoextra package. -
"hcut"
: Produced byhcut
from the factoextra package. -
"pam"
: Produced bypam
from the cluster package. -
"clara"
: Produced byclara
from the cluster package. -
"fanny"
: Produced byfanny
from the cluster package. -
"silhouette"
: Produced bysilhouette
from the cluster package.
For these classes ("eclust"
, "hcut"
, "pam"
, "clara"
, "fanny"
, "silhouette"
), users should explicitly call plotSilhouette()
(e.g., plotSilhouette(pam_result)
) to ensure the correct method is used, as the generic plot()
may not dispatch to this function for these objects.
Value
A ggplot2
object representing the Silhouette plot.
References
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. doi:10.1016/0377-0427(87)90125-7
See Also
Silhouette
, softSilhouette
, eclust
, hcut
, pam
, clara
, fanny
, silhouette
Examples
data(iris)
# Crisp Silhouette with k-means
out <- kmeans(iris[, -5], 3)
if (requireNamespace("proxy", quietly = TRUE)) {
library(proxy)
dist <- dist(iris[, -5], out$centers)
plot(Silhouette(dist))
}
#' # Fuzzy Silhouette with ppclust::fcm
if (requireNamespace("ppclust", quietly = TRUE)) {
library(ppclust)
out_fuzzy <- Silhouette(
prox_matrix = "d",
proximity_type = "dissimilarity",
prob_matrix = "u",
clust_fun = ppclust::fcm,
x = iris[, 1:4],
centers = 3,
sort = TRUE
)
plot(out_fuzzy, summary.legend = FALSE, grayscale = TRUE)
} else {
message("Install 'ppclust': install.packages('ppclust')")
}
# Silhouette plot for pam clustering
if (requireNamespace("cluster", quietly = TRUE)) {
library(cluster)
pam_result <- pam(iris[, 1:4], k = 3)
plotSilhouette(pam_result)
}
# Silhouette plot for clara clustering
if (requireNamespace("cluster", quietly = TRUE)) {
clara_result <- clara(iris[, 1:4], k = 3)
plotSilhouette(clara_result)
}
# Silhouette plot for fanny clustering
if (requireNamespace("cluster", quietly = TRUE)) {
fanny_result <- fanny(iris[, 1:4], k = 3)
plotSilhouette(fanny_result)
}
# Example using base silhouette() object
if (requireNamespace("cluster", quietly = TRUE)) {
sil <- silhouette(pam_result)
plotSilhouette(sil)
}
# Silhouette plot for eclust clustering
if (requireNamespace("factoextra", quietly = TRUE)) {
library(factoextra)
eclust_result <- eclust(iris[, 1:4], "kmeans", k = 3, graph = FALSE)
plotSilhouette(eclust_result)
}
# Silhouette plot for hcut clustering
if (requireNamespace("factoextra", quietly = TRUE)) {
hcut_result <- hcut(iris[, 1:4], k = 3)
plotSilhouette(hcut_result)
}
Calculate Silhouette Width for Soft Clustering Algorithms
Description
Computes silhouette widths for soft clustering results by interpreting cluster membership probabilities (or their transformations) as proximity measures. Although originally designed for evaluating clustering quality within a method, this adaptation allows heuristic comparison across soft clustering algorithms using average silhouette widths.
Usage
softSilhouette(
prob_matrix,
prob_type = c("pp", "nlpp", "pd"),
method = c("pac", "medoid"),
average = c("crisp", "fuzzy"),
a = 2,
print.summary = FALSE,
clust_fun = NULL,
...
)
Arguments
prob_matrix |
A numeric matrix where rows represent observations and columns represent cluster membership probabilities (or transformed probabilities, depending on |
prob_type |
Character string specifying the type transformation of membership matrix considered as proximity matrix in
Defaults to |
method |
Character string specifying the silhouette calculation method. Options are |
average |
Character string specifying the type of average silhouette width calculation. Options are |
a |
Numeric value controlling the fuzzifier or weight scaling in fuzzy silhouette averaging. Higher values increase the emphasis on strong membership differences. Must be positive. Defaults to |
print.summary |
Logical; if |
clust_fun |
Optional S3 or S4 function object or function as character string specifying a clustering function that produces the proximity measure matrix. For example, |
... |
Additional arguments passed to |
Details
Although the silhouette method was originally developed for evaluating clustering structure within a single result, this implementation allows leveraging cluster membership probabilities from soft clustering methods to construct proximity-based silhouettes. These silhouette widths can be compared heuristically across different algorithms to assess clustering quality.
See doi:10.1080/23737484.2024.2408534 for more details.
Value
A data frame of class "Silhouette"
containing cluster assignments, nearest neighbor clusters, silhouette widths for each observation, and weights (for fuzzy clustering). The object includes the following attributes:
- proximity_type
The proximity type used (
"similarity"
or"dissimilarity"
).- method
The silhouette calculation method used (
"medoid"
or"pac"
).
References
Raymaekers, J., & Rousseeuw, P. J. (2022). Silhouettes and quasi residual plots for neural nets and tree-based classifiers. Journal of Computational and Graphical Statistics, 31(4), 1332–1343. doi:10.1080/10618600.2022.2050249
Bhat Kapu, S., & Kiruthika. (2024). Some density-based silhouette diagnostics for soft clustering algorithms. Communications in Statistics: Case Studies, Data Analysis and Applications, 10(3-4), 221-238. doi:10.1080/23737484.2024.2408534
See Also
Examples
# Compare two soft clustering algorithms using softSilhouett
# Example: FCM vs. FCM2 on iris data, using average silhouette width as a criterion
data(iris)
if (requireNamespace("ppclust", quietly = TRUE)) {
fcm_result <- ppclust::fcm(iris[, 1:4], 3)
out_fcm <- softSilhouette(prob_matrix = fcm_result$u,print.summary = TRUE)
plot(out_fcm)
sfcm <- summary(out_fcm, print.summary = FALSE)
} else {
message("Install 'ppclust' to run this example: install.packages('ppclust')")
}
if (requireNamespace("ppclust", quietly = TRUE)) {
fcm2_result <- ppclust::fcm2(iris[, 1:4], 3)
out_fcm2 <- softSilhouette(prob_matrix = fcm2_result$u,print.summary = TRUE)
plot(out_fcm2)
sfcm2 <- summary(out_fcm2, print.summary = FALSE)
} else {
message("Install 'ppclust' to run this example: install.packages('ppclust')")
}
# Compare average silhouette widths of fcm and fcm2
if (requireNamespace("ppclust", quietly = TRUE)) {
cat("FCM average silhouette width:", sfcm$avg.width, "\n")
cat("FCM2 average silhouette width:", sfcm2$avg.width, "\n")
}