\documentclass[fleqn]{article} \usepackage[round,longnamesfirst]{natbib} \usepackage{graphicx,keyval,hyperref,doi} \newcommand\argmin{\mathop{\mathrm{arg min}}} \newcommand\trace{\mathop{\mathrm{tr}}} \newcommand\R{{\mathbb{R}}} \newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\sQuote}[1]{`{#1}'} \newcommand{\dQuote}[1]{``{#1}''} \let\code=\texttt \newcommand{\file}[1]{\sQuote{\textsf{#1}}} \newcommand{\class}[1]{\code{"#1"}} \SweaveOpts{strip.white=true} \AtBeginDocument{\setkeys{Gin}{width=0.6\textwidth}} \date{2007-06-28} \title{A CLUE for CLUster Ensembles} \author{Kurt Hornik} %% \VignetteIndexEntry{CLUster Ensembles} \sloppy{} \begin{document} \maketitle \begin{abstract} Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package~\pkg{clue} provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structures for representing partitions and hierarchies, and facilities for computing on these, including methods for measuring proximity and obtaining consensus and ``secondary'' clusterings. \end{abstract} <>= options(width = 60) library("clue") @ % \section{Introduction} \label{sec:introduction} \emph{Cluster ensembles} are collections of clusterings, which are all of the same ``kind'' (e.g., collections of partitions, or collections of hierarchies), of a set of objects. Such ensembles can be obtained, for example, by varying the (hyper)parameters of a ``base'' clustering algorithm, by resampling or reweighting the set of objects, or by employing several different base clusterers. Questions of ``agreement'' in cluster ensembles, and obtaining ``consensus'' clusterings from it, have been studied in several scientific communities for quite some time now. A special issue of the Journal of Classification was devoted to ``Comparison and Consensus of Classifications'' \citep{cluster:Day:1986} almost two decades ago. The recent popularization of ensemble methods such as Bayesian model averaging \citep{cluster:Hoeting+Madigan+Raftery:1999}, bagging \citep{cluster:Breiman:1996} and boosting \citep{cluster:Friedman+Hastie+Tibshirani:2000}, typically in a supervised leaning context, has also furthered the research interest in using ensemble methods to improve the quality and robustness of cluster solutions. Cluster ensembles can also be utilized to aggregate base results over conditioning or grouping variables in multi-way data, to reuse existing knowledge, and to accommodate the needs of distributed computing, see e.g.\ \cite{cluster:Hornik:2005a} and \cite{cluster:Strehl+Ghosh:2003a} for more information. Package~\pkg{clue} is an extension package for R~\citep{cluster:R:2005} providing a computational environment for creating and analyzing cluster ensembles. In Section~\ref{sec:structures+algorithms}, we describe the underlying data structures, and the functionality for measuring proximity, obtaining consensus clusterings, and ``secondary'' clusterings. Four examples are discussed in Section~\ref{sec:examples}. Section~\ref{sec:outlook} concludes the paper. A previous version of this manuscript was published in the \emph{Journal of Statistical Software} \citep{cluster:Hornik:2005b}. \section{Data structures and algorithms} \label{sec:structures+algorithms} \subsection{Partitions and hierarchies} Representations of clusterings of objects greatly vary across the multitude of methods available in R packages. For example, the class ids (``cluster labels'') for the results of \code{kmeans()} in base package~\pkg{stats}, \code{pam()} in recommended package~\pkg{cluster}~\citep{cluster:Rousseeuw+Struyf+Hubert:2005, cluster:Struyf+Hubert+Rousseeuw:1996}, and \code{Mclust()} in package~\pkg{mclust}~\citep{cluster:Fraley+Raftery+Wehrens:2005, cluster:Fraley+Raftery:2003}, are available as components named \code{cluster}, \code{clustering}, and \code{classification}, respectively, of the R objects returned by these functions. In many cases, the representations inherit from suitable classes. (We note that for versions of R prior to 2.1.0, \code{kmeans()} only returned a ``raw'' (unclassed) result, which was changed alongside the development of \pkg{clue}.) We deal with this heterogeneity of representations by providing getters for the key underlying data, such as the number of objects from which a clustering was obtained, and predicates, e.g.\ for determining whether an R object represents a partition of objects or not. These getters, such as \code{n\_of\_objects()}, and predicates are implemented as S3 generics, so that there is a \emph{conceptual}, but no formal class system underlying the predicates. Support for classed representations can easily be added by providing S3 methods. \subsubsection{Partitions} The partitions considered in \pkg{clue} are possibly soft (``fuzzy'') partitions, where for each object~$i$ and class~$j$ there is a non-negative number~$\mu_{ij}$ quantifying the ``belongingness'' or \emph{membership} of object~$i$ to class~$j$, with $\sum_j \mu_{ij} = 1$. For hard (``crisp'') partitions, all $\mu_{ij}$ are in $\{0, 1\}$. We can gather the $\mu_{ij}$ into the \emph{membership matrix} $M = [\mu_{ij}]$, where rows correspond to objects and columns to classes. The \emph{number of classes} of a partition, computed by function \code{n\_of\_classes()}, is the number of $j$ for which $\mu_{ij} > 0$ for at least one object~$i$. This may be less than the number of ``available'' classes, corresponding to the number of columns in a membership matrix representing the partition. The predicate functions \code{is.cl\_partition()}, \code{is.cl\_hard\_partition()}, and \code{is.cl\_soft\_partition()} are used to indicate whether R objects represent partitions of objects of the respective kind, with hard partitions as characterized above (all memberships in $\{0, 1\}$). (Hence, ``fuzzy clustering'' algorithms can in principle also give a hard partition.) \code{is.cl\_partition()} and \code{is.cl\_hard\_partition()} are generic functions; \code{is.cl\_soft\_partition()} gives true iff \code{is.cl\_partition()} is true and \code{is.cl\_hard\_partition()} is false. For R objects representing partitions, function \code{cl\_membership()} computes an R object with the membership values, currently always as a dense membership matrix with additional attributes. This is obviously rather inefficient for computations on hard partitions; we are planning to add ``canned'' sparse representations (using the vector of class ids) in future versions. Function \code{as.cl\_membership()} can be used for coercing \dQuote{raw} class ids (given as atomic vectors) or membership values (given as numeric matrices) to membership objects. Function \code{cl\_class\_ids()} determines the class ids of a partition. For soft partitions, the class ids returned are those of the \dQuote{nearest} hard partition obtained by taking the class ids of the (first) maximal membership values. Note that the cardinality of the set of the class ids may be less than the number of classes in the (soft) partition. Many partitioning methods are based on \emph{prototypes} (``centers''). In typical cases, these are points~$p_j$ in the same feature space the measurements~$x_i$ on the objects~$i$ to be partitioned are in, so that one can measure distance between objects and prototypes, and e.g.\ classify objects to their closest prototype. Such partitioning methods can also induce partitions of the entire feature space (rather than ``just'' the set of objects to be partitioned). Currently, package \pkg{clue} has only minimal support for this ``additional'' structure, providing a \code{cl\_prototypes()} generic for extracting the prototypes, and is mostly focused on computations on partitions which are based on their memberships. Many algorithms resulting in partitions of a given set of objects can be taken to induce a partition of the underlying feature space for the measurements on the objects, so that class memberships for ``new'' objects can be obtained from the induced partition. Examples include partitions based on assigning objects to their ``closest'' prototypes, or providing mixture models for the distribution of objects in feature space. Package~\pkg{clue} provides a \code{cl\_predict()} generic for predicting the class memberships of new objects (if possible). Function \code{cl\_fuzziness()} computes softness (fuzziness) measures for (ensembles) of partitions. Built-in measures are the partition coefficient \label{PC} and partition entropy \citep[e.g.,][]{cluster:Bezdek:1981}, with an option to normalize in a way that hard partitions and the ``fuzziest'' possible partition (where all memberships are the same) get fuzziness values of zero and one, respectively. Note that this normalization differs from ``standard'' ones in the literature. In the sequel, we shall also use the concept of the \emph{co-membership matrix} $C(M) = M M'$, where $'$ denotes matrix transposition, of a partition. For hard partitions, an entry $c_{ij}$ of $C(M)$ is 1 iff the corresponding objects $i$ and $j$ are in the same class, and 0 otherwise. \subsubsection{Hierarchies} The hierarchies considered in \pkg{clue} are \emph{total indexed hierarchies}, also known as \emph{$n$-valued trees}, and hence correspond in a one-to-one manner to \emph{ultrametrics} (distances $u_{ij}$ between pairs of objects $i$ and $j$ which satisfy the ultrametric constraint $u_{ij} = \max(u_{ik}, u_{jk})$ for all triples $i$, $j$, and $k$). See e.g.~\citet[Page~69--71]{cluster:Gordon:1999}. Function \code{cl\_ultrametric(x)} computes the associated ultrametric from an R object \code{x} representing a hierarchy of objects. If \code{x} is not an ultrametric, function \code{cophenetic()} in base package~\pkg{stats} is used to obtain the ultrametric (also known as cophenetic) distances from the hierarchy, which in turn by default calls the S3 generic \code{as.hclust()} (also in \pkg{stats}) on the hierarchy. Support for classes which represent hierarchies can thus be added by providing \code{as.hclust()} methods for this class. In R~2.1.0 or better (again as part of the work on \pkg{clue}), \code{cophenetic} is an S3 generic as well, and one can also more directly provide methods for this if necessary. In addition, there is a generic function \code{as.cl\_ultrametric()} which can be used for coercing \emph{raw} (non-classed) ultrametrics, represented as numeric vectors (of the lower-half entries) or numeric matrices, to ultrametric objects. Finally, the generic predicate function \code{is.cl\_hierarchy()} is used to determine whether an R object represents a hierarchy or not. Ultrametric objects can also be coerced to classes~\class{dendrogram} and \class{hclust} (from base package~\pkg{stats}), and hence in particular use the \code{plot()} methods for these classes. By default, plotting an ultrametric object uses the plot method for dendrograms. Obtaining a hierarchy on a given set of objects can be thought of as transforming the pairwise dissimilarities between the objects (which typically do not yet satisfy the ultrametric constraints) into an ultrametric. Ideally, this ultrametric should be as close as possible to the dissimilarities. In some important cases, explicit solutions are possible (e.g., ``standard'' hierarchical clustering with single or complete linkage gives the optimal ultrametric dominated by or dominating the dissimilarities, respectively). On the other hand, the problem of finding the closest ultrametric in the least squares sense is known to be NP-hard \citep{cluster:Krivanek+Moravek:1986,cluster:Krivanek:1986}. One important class of heuristics for finding least squares fits is based on iterative projection on convex sets of constraints \citep{cluster:Hubert+Arabie:1995}. \label{SUMT} Function \code{ls\_fit\_ultrametric()} follows \cite{cluster:DeSoete:1986} to use an SUMT \citep[Sequential Unconstrained Minimization Technique;][]{cluster:Fiacco+McCormick:1968} approach in turn simplifying the suggestions in \cite{cluster:Carroll+Pruzansky:1980}. Let $L(u)$ be the function to be minimized over all $u$ in some constrained set $\mathcal{U}$---in our case, $L(u) = \sum (d_{ij}-u_{ij})^2$ is the least squares criterion, and $\mathcal{U}$ is the set of all ultrametrics $u$. One iteratively minimizes $L(u) + \rho_k P(u)$, where $P(u)$ is a non-negative function penalizing violations of the constraints such that $P(u)$ is zero iff $u \in \mathcal{U}$. The $\rho$ values are increased according to the rule $\rho_{k+1} = q \rho_k$ for some constant $q > 1$, until convergence is obtained in the sense that e.g.\ the Euclidean distance between successive solutions $u_k$ and $u_{k+1}$ is small enough. Optionally, the final $u_k$ is then suitably projected onto $\mathcal{U}$. For \code{ls\_fit\_ultrametric()}, we obtain the starting value $u_0$ by \dQuote{random shaking} of the given dissimilarity object, and use the penalty function $P(u) = \sum_{\Omega} (u_{ij} - u_{jk}) ^ 2$, were $\Omega$ contains all triples $i, j, k$ for which $u_{ij} \le \min(u_{ik}, u_{jk})$ and $u_{ik} \ne u_{jk}$, i.e., for which $u$ violates the ultrametric constraints. The unconstrained minimizations are carried out using either \code{optim()} or \code{nlm()} in base package~\pkg{stats}, with analytic gradients given in \cite{cluster:Carroll+Pruzansky:1980}. This ``works'', even though we note however that $P$ is not even a continuous function, which seems to have gone unnoticed in the literature! (Consider an ultrametric $u$ for which $u_{ij} = u_{ik} < u_{jk}$ for some $i, j, k$ and define $u(\delta)$ by changing the $u_{ij}$ to $u_{ij} + \delta$. For $u$, both $(i,j,k)$ and $(j,i,k)$ are in the violation set $\Omega$, whereas for all $\delta$ sufficiently small, only $(j,i,k)$ is the violation set for $u(\delta)$. Hence, $\lim_{\delta\to 0} P(u(\delta)) = P(u) + (u_{ij} - u_{ik})^2$. This shows that $P$ is discontinuous at all non-constant $u$ with duplicated entries. On the other hand, it is continuously differentiable at all $u$ with unique entries.) Hence, we need to turn off checking analytical gradients when using \code{nlm()} for minimization. The default optimization using conjugate gradients should work reasonably well for medium to large size problems. For \dQuote{small} ones, using \code{nlm()} is usually faster. Note that the number of ultrametric constraints is of the order $n^3$, suggesting to use the SUMT approach in favor of \code{constrOptim()} in \pkg{stats}. It should be noted that the SUMT approach is a heuristic which can not be guaranteed to find the global minimum. Standard practice would recommend to use the best solution found in \dQuote{sufficiently many} replications of the base algorithm. \subsubsection{Extensibility} The methods provided in package~\pkg{clue} handle the partitions and hierarchies obtained from clustering functions in the base R distribution, as well as packages \pkg{RWeka}~\citep{cluster:Hornik+Hothorn+Karatzoglou:2006}, \pkg{cba}~\citep{cluster:Buchta+Hahsler:2005}, \pkg{cclust}~\citep{cluster:Dimitriadou:2005}, \pkg{cluster}, \pkg{e1071}~\citep{cluster:Dimitriadou+Hornik+Leisch:2005}, \pkg{flexclust}~\citep{cluster:Leisch:2006a}, \pkg{flexmix}~\citep{cluster:Leisch:2004}, \pkg{kernlab}~\citep{cluster:Karatzoglou+Smola+Hornik:2004}, and \pkg{mclust} (and of course, \pkg{clue} itself). Extending support to other packages is straightforward, provided that clusterings are instances of classes. Suppose e.g.\ that a package has a function \code{glvq()} for ``generalized'' (i.e., non-Euclidean) Learning Vector Quantization which returns an object of class~\class{glvq}, in turn being a list with component \code{class\_ids} containing the class ids. To integrate this into the \pkg{clue} framework, all that is necessary is to provide the following methods. <<>>= cl_class_ids.glvq <- function(x) as.cl_class_ids(x$class_ids) is.cl_partition.glvq <- function(x) TRUE is.cl_hard_partition.glvq <- function(x) TRUE @ % $ \subsection{Cluster ensembles} Cluster ensembles are realized as lists of clusterings with additional class information. All clusterings in an ensemble must be of the same ``kind'' (i.e., either all partitions as known to \code{is.cl\_partition()}, or all hierarchies as known to \code{is.cl\_hierarchy()}, respectively), and have the same number of objects. If all clusterings are partitions, the list realizing the ensemble has class~\class{cl\_partition\_ensemble} and inherits from \class{cl\_ensemble}; if all clusterings are hierarchies, it has class~\class{cl\_hierarchy\_ensemble} and inherits from \class{cl\_ensemble}. Empty ensembles cannot be categorized according to the kind of clusterings they contain, and hence only have class~\class{cl\_ensemble}. Function \code{cl\_ensemble()} creates a cluster ensemble object from clusterings given either one-by-one, or as a list passed to the \code{list} argument. As unclassed lists could be used to represent single clusterings (in particular for results from \code{kmeans()} in versions of R prior to 2.1.0), we prefer not to assume that an unnamed given list is a list of clusterings. \code{cl\_ensemble()} verifies that all given clusterings are of the same kind, and all have the same number of objects. (By the notion of cluster ensembles, we should in principle verify that the clusterings come from the \emph{same} objects, which of course is not always possible.) The list representation makes it possible to use \code{lapply()} for computations on the individual clusterings in (i.e., the components of) a cluster ensemble. Available methods for cluster ensembles include those for subscripting, \code{c()}, \code{rep()}, \code{print()}, and \code{unique()}, where the last is based on a \code{unique()} method for lists added in R~2.1.1, and makes it possible to find unique and duplicated elements in cluster ensembles. The elements of the ensemble can be tabulated using \code{cl\_tabulate()}. Function \code{cl\_boot()} generates cluster ensembles with bootstrap replicates of the results of applying a \dQuote{base} clustering algorithm to a given data set. Currently, this is a rather simple-minded function with limited applicability, and mostly useful for studying the effect of (uncontrolled) random initializations of fixed-point partitioning algorithms such as \code{kmeans()} or \code{cmeans()} in package~\pkg{e1071}. To study the effect of varying control parameters or explicitly providing random starting values, the respective cluster ensemble has to be generated explicitly (most conveniently by using \code{replicate()} to create a list \code{lst} of suitable instances of clusterings obtained by the base algorithm, and using \code{cl\_ensemble(list = lst)} to create the ensemble). Resampling the training data is possible for base algorithms which can predict the class memberships of new data using \code{cl\_predict} (e.g., by classifying the out-of-bag data to their closest prototype). In fact, we believe that for unsupervised learning methods such as clustering, \emph{reweighting} is conceptually superior to resampling, and have therefore recently enhanced package~\pkg{e1071} to provide an implementation of weighted fuzzy $c$-means, and package~\pkg{flexclust} contains an implementation of weighted $k$-means. We are currently experimenting with interfaces for providing ``direct'' support for reweighting via \code{cl\_boot()}. \subsection{Cluster proximities} \subsubsection{Principles} Computing dissimilarities and similarities (``agreements'') between clusterings of the same objects is a key ingredient in the analysis of cluster ensembles. The ``standard'' data structures available for such proximity data (measures of similarity or dissimilarity) are classes~\class{dist} and \class{dissimilarity} in package~\pkg{cluster} (which basically, but not strictly, extends \class{dist}), and are both not entirely suited to our needs. First, they are confined to \emph{symmetric} dissimilarity data. Second, they do not provide enough reflectance. We also note that the Bioconductor package~\pkg{graph}~\citep{cluster:Gentleman+Whalen:2005} contains an efficient subscript method for objects of class~\class{dist}, but returns a ``raw'' matrix for row/column subscripting. For package~\pkg{clue}, we use the following approach. There are classes for symmetric and (possibly) non-symmetric proximity data (\class{cl\_proximity} and \class{cl\_cross\_proximity}), which, in addition to holding the numeric data, also contain a description ``slot'' (attribute), currently a character string, as a first approximation to providing more reflectance. Internally, symmetric proximity data are store the lower diagonal proximity values in a numeric vector (in row-major order), i.e., the same way as objects of class~\class{dist}; a \code{self} attribute can be used for diagonal values (in case some of these are non-zero). Symmetric proximity objects can be coerced to dense matrices using \code{as.matrix()}. It is possible to use 2-index matrix-style subscripting for symmetric proximity objects; unless this uses identical row and column indices, it results in a non-symmetric proximity object. This approach ``propagates'' to classes for symmetric and (possibly) non-symmetric cluster dissimilarity and agreement data (e.g., \class{cl\_dissimilarity} and \class{cl\_cross\_dissimilarity} for dissimilarity data), which extend the respective proximity classes. Ultrametric objects are implemented as symmetric proximity objects with a dissimilarity interpretation so that self-proximities are zero, and inherit from classes~\class{cl\_dissimilarity} and \class{cl\_proximity}. Providing reflectance is far from optimal. For example, if \code{s} is a similarity object (with cluster agreements), \code{1 - s} is a dissimilarity one, but the description is preserved unchanged. This issue could be addressed by providing high-level functions for transforming proximities. \label{synopsis} Cluster dissimilarities are computed via \code{cl\_dissimilarity()} with synopsis \code{cl\_dissimilarity(x, y = NULL, method = "euclidean")}, where \code{x} and \code{y} are cluster ensemble objects or coercible to such, or \code{NULL} (\code{y} only). If \code{y} is \code{NULL}, the return value is an object of class~\class{cl\_dissimilarity} which contains the dissimilarities between all pairs of clusterings in \code{x}. Otherwise, it is an object of class~\class{cl\_cross\_dissimilarity} with the dissimilarities between the clusterings in \code{x} and the clusterings in \code{y}. Formal argument \code{method} is either a character string specifying one of the built-in methods for computing dissimilarity, or a function to be taken as a user-defined method, making it reasonably straightforward to add methods. Function \code{cl\_agreement()} has the same interface as \code{cl\_dissimilarity()}, returning cluster similarity objects with respective classes~\class{cl\_agreement} and \class{cl\_cross\_agreement}. Built-in methods for computing dissimilarities may coincide (in which case they are transforms of each other), but do not necessarily do so, as there typically are no canonical transformations. E.g., according to needs and scientific community, agreements might be transformed to dissimilarities via $d = - \log(s)$ or the square root thereof \citep[e.g.,][]{cluster:Strehl+Ghosh:2003b}, or via $d = 1 - s$. \subsubsection{Partition proximities} When assessing agreement or dissimilarity of partitions, one needs to consider that the class ids may be permuted arbitrarily without changing the underlying partitions. For membership matrices~$M$, permuting class ids amounts to replacing $M$ by $M \Pi$, where $\Pi$ is a suitable permutation matrix. We note that the co-membership matrix $C(M) = MM'$ is unchanged by these transformations; hence, proximity measures based on co-occurrences, such as the Katz-Powell \citep{cluster:Katz+Powell:1953} or Rand \citep{cluster:Rand:1971} indices, do not explicitly need to adjust for possible re-labeling. The same is true for measures based on the ``confusion matrix'' $M' \tilde{M}$ of two membership matrices $M$ and $\tilde{M}$ which are invariant under permutations of rows and columns, such as the Normalized Mutual Information (NMI) measure introduced in \cite{cluster:Strehl+Ghosh:2003a}. Other proximity measures need to find permutations so that the classes are optimally matched, which of course in general requires exhaustive search through all $k!$ possible permutations, where $k$ is the (common) number of classes in the partitions, and thus will typically be prohibitively expensive. Fortunately, in some important cases, optimal matchings can be determined very efficiently. We explain this in detail for ``Euclidean'' partition dissimilarity and agreement (which in fact is the default measure used by \code{cl\_dissimilarity()} and \code{cl\_agreement()}). Euclidean partition dissimilarity \citep{cluster:Dimitriadou+Weingessel+Hornik:2002} is defined as \begin{displaymath} d(M, \tilde{M}) = \min\nolimits_\Pi \| M - \tilde{M} \Pi \| \end{displaymath} where the minimum is taken over all permutation matrices~$\Pi$, $\|\cdot\|$ is the Frobenius norm (so that $\|Y\|^2 = \trace(Y'Y)$), and $n$ is the (common) number of objects in the partitions. As $\| M - \tilde{M} \Pi \|^2 = \trace(M'M) - 2 \trace(M'\tilde{M}\Pi) + \trace(\Pi'\tilde{M}'\tilde{M}\Pi) = \trace(M'M) - 2 \trace(M'\tilde{M}\Pi) + \trace(\tilde{M}'\tilde{M})$, we see that minimizing $\| M - \tilde{M} \Pi \|^2$ is equivalent to maximizing $\trace(M'\tilde{M}\Pi) = \sum_{i,k}{\mu_{ik}\tilde{\mu}}_{i,\pi(k)}$, which for hard partitions is the number of objects with the same label in the partitions given by $M$ and $\tilde{M}\Pi$. Finding the optimal $\Pi$ is thus recognized as an instance of the \emph{linear sum assignment problem} (LSAP, also known as the weighted bipartite graph matching problem). The LSAP can be solved by linear programming, e.g., using Simplex-style primal algorithms as done by function~\code{lp.assign()} in package~\pkg{lpSolve}~\citep{cluster:Buttrey:2005}, but primal-dual algorithms such as the so-called Hungarian method can be shown to find the optimum in time $O(k^3)$ \citep[e.g.,][]{cluster:Papadimitriou+Steiglitz:1982}. Available published implementations include TOMS 548 \citep{cluster:Carpaneto+Toth:1980}, which however is restricted to integer weights and $k < 131$. One can also transform the LSAP into a network flow problem, and use e.g.~RELAX-IV \citep{cluster:Bertsekas+Tseng:1994} for solving this, as is done in package~\pkg{optmatch}~\citep{cluster:Hansen:2005}. In package~\pkg{clue}, we use an efficient C implementation of the Hungarian algorithm kindly provided to us by Walter B\"ohm, which has been found to perform very well across a wide range of problem sizes. \cite{cluster:Gordon+Vichi:2001} use a variant of Euclidean dissimilarity (``GV1 dissimilarity'') which is based on the sum of the squared difference of the memberships of matched (non-empty) classes only, discarding the unmatched ones (see their Example~2). This results in a measure which is discontinuous over the space of soft partitions with arbitrary numbers of classes. The partition agreement measures ``angle'' and ``diag'' (maximal cosine of angle between the memberships, and maximal co-classification rate, where both maxima are taken over all column permutations of the membership matrices) are based on solving the same LSAP as for Euclidean dissimilarity. Finally, Manhattan partition dissimilarity is defined as the minimal sum of the absolute differences of $M$ and all column permutations of $\tilde{M}$, and can again be computed efficiently by solving an LSAP. For hard partitions, both Manhattan and squared Euclidean dissimilarity give twice the \emph{transfer distance} \citep{cluster:Charon+Denoeud+Guenoche:2006}, which is the minimum number of objects that must be removed so that the implied partitions (restrictions to the remaining objects) are identical. This is also known as the \emph{$R$-metric} in \cite{cluster:Day:1981}, i.e., the number of augmentations and removals of single objects needed to transform one partition into the other, and the \emph{partition-distance} in \cite{cluster:Gusfield:2002}. Note when assessing proximity that agreements for soft partitions are always (and quite often considerably) lower than the agreements for the corresponding nearest hard partitions, unless the agreement measures are based on the latter anyways (as currently done for Rand, Katz-Powell, and NMI). Package~\pkg{clue} provides additional agreement measures, such as the Jaccard and Fowles-Mallows \citep[quite often incorrectly attributed to \cite{cluster:Wallace:1983}]{cluster:Fowlkes+Mallows:1983a} indices, and dissimilarity measures such as the ``symdiff'' and Rand distances (the latter is proportional to the metric of \cite{cluster:Mirkin:1996}) and the metrics discussed in \cite{cluster:Boorman+Arabie:1972}. One could easily add more proximity measures, such as the ``Variation of Information'' \citep{cluster:Meila:2003}. However, all these measures are rigorously defined for hard partitions only. To see why extensions to soft partitions are far from straightforward, consider e.g.\ measures based on the confusion matrix. Its entries count the cardinality of certain intersections of sets. \label{fuzzy} In a fuzzy context for soft partitions, a natural generalization would be using fuzzy cardinalities (i.e., sums of memberships values) of fuzzy intersections instead. There are many possible choices for the latter, with the product of the membership values (corresponding to employing the confusion matrix also in the fuzzy case) one of them, but the minimum instead of the product being the ``usual'' choice. A similar point can be made for co-occurrences of soft memberships. We are not aware of systematic investigations of these extension issues. \subsubsection{Hierarchy proximities} Available built-in dissimilarity measures for hierarchies include \emph{Euclidean} (again, the default measure used by \code{cl\_dissimilarity()}) and Manhattan dissimilarity, which are simply the Euclidean (square root of the sum of squared differences) and Manhattan (sum of the absolute differences) dissimilarities between the associated ultrametrics. Cophenetic dissimilarity is defined as $1 - c^2$, where $c$ is the cophenetic correlation coefficient \citep{cluster:Sokal+Rohlf:1962}, i.e., the Pearson product-moment correlation between the ultrametrics. Gamma dissimilarity is the rate of inversions between the associated ultrametrics $u$ and $v$ (i.e., the rate of pairs $(i,j)$ and $(k,l)$ for which $u_{ij} < u_{kl}$ and $v_{ij} > v_{kl}$). This measure is a linear transformation of Kruskal's~$\gamma$. Finally, symdiff dissimilarity is the cardinality of the symmetric set difference of the sets of classes (hierarchies in the strict sense) induced by the dendrograms. Associated agreement measures are obtained by suitable transformations of the dissimilarities~$d$; for Euclidean proximities, we prefer to use $1 / (1 + d)$ rather than e.g.\ $\exp(-d)$. One should note that whereas cophenetic and gamma dissimilarities are invariant to linear transformations, Euclidean and Manhattan ones are not. Hence, if only the relative ``structure'' of the dendrograms is of interest, these dissimilarities should only be used after transforming the ultrametrics to a common range of values (e.g., to $[0,1]$). \subsection{Consensus clusterings} Consensus clusterings ``synthesize'' the information in the elements of a cluster ensemble into a single clustering. There are three main approaches to obtaining consensus clusterings \citep{cluster:Hornik:2005a,cluster:Gordon+Vichi:2001}: in the \emph{constructive} approach, one specifies a way to construct a consensus clustering. In the \emph{axiomatic} approach, emphasis is on the investigation of existence and uniqueness of consensus clusterings characterized axiomatically. The \emph{optimization} approach formalizes the natural idea of describing consensus clusterings as the ones which ``optimally represent the ensemble'' by providing a criterion to be optimized over a suitable set $\mathcal{C}$ of possible consensus clusterings. If $d$ is a dissimilarity measure and $C_1, \ldots, C_B$ are the elements of the ensemble, one can e.g.\ look for solutions of the problem \begin{displaymath} \sum\nolimits_{b=1}^B w_b d(C, C_b) ^ p \Rightarrow \min\nolimits_{C \in \mathcal{C}}, \end{displaymath} for some $p \ge 0$, i.e., as clusterings~$C^*$ minimizing weighted average dissimilarity powers of order~$p$. Analogously, if a similarity measure is given, one can look for clusterings maximizing weighted average similarity powers. Following \cite{cluster:Gordon+Vichi:1998}, an above $C^*$ is referred to as (weighted) \emph{median} or \emph{medoid} clustering if $p = 1$ and the optimum is sought over the set of all possible base clusterings, or the set $\{ C_1, \ldots, C_B \}$ of the base clusterings, respectively. For $p = 2$, we have \emph{least squares} consensus clusterings (generalized means). For computing consensus clusterings, package~\pkg{clue} provides function \code{cl\_consensus()} with synopsis \code{cl\_consensus(x, method = NULL, weights = 1, control = list())}. This allows (similar to the functions for computing cluster proximities, see Section~\ref{synopsis} on Page~\pageref{synopsis}) argument \code{method} to be a character string specifying one of the built-in methods discussed below, or a function to be taken as a user-defined method (taking an ensemble, the case weights, and a list of control parameters as its arguments), again making it reasonably straightforward to add methods. In addition, function~\code{cl\_medoid()} can be used for obtaining medoid partitions (using, in principle, arbitrary dissimilarities). Modulo possible differences in the case of ties, this gives the same results as (the medoid obtained by) \code{pam()} in package~\pkg{cluster}. If all elements of the ensemble are partitions, package~\pkg{clue} provides algorithms for computing soft least squares consensus partitions for weighted Euclidean, GV1 and co-membership dissimilarities. Let $M_1, \ldots, M_B$ and $M$ denote the membership matrices of the elements of the ensemble and their sought least squares consensus partition, respectively. For Euclidean dissimilarity, we need to find \begin{displaymath} \sum_b w_b \min\nolimits_{\Pi_b} \| M - M_b \Pi_b \|^2 \Rightarrow \min\nolimits_M \end{displaymath} over all membership matrices (i.e., stochastic matrices) $M$, or equivalently, \begin{displaymath} \sum_b w_b \| M - M_b \Pi_b \|^2 \Rightarrow \min\nolimits_{M, \Pi_1, \ldots, \Pi_B} \end{displaymath} over all $M$ and permutation matrices $\Pi_1, \ldots, \Pi_B$. Now fix the $\Pi_b$ and let $\bar{M} = s^{-1} \sum_b w_b M_b \Pi_b$ be the weighted average of the $M_b \Pi_b$, where $s = \sum_b w_b$. Then \begin{eqnarray*} \lefteqn{\sum_b w_b \| M - M_b \Pi_b \|^2} \\ &=& \sum_b w_b (\|M\|^2 - 2 \trace(M' M_b \Pi_b) + \|M_b\Pi_b\|^2) \\ &=& s \|M\|^2 - 2 s \trace(M' \bar{M}) + \sum_b w_b \|M_b\|^2 \\ &=& s (\|M - \bar{M}\|^2) + \sum_b w_b \|M_b\|^2 - s \|\bar{M}\|^2 \end{eqnarray*} Thus, as already observed in \cite{cluster:Dimitriadou+Weingessel+Hornik:2002} and \cite{cluster:Gordon+Vichi:2001}, for fixed permutations $\Pi_b$ the optimal soft $M$ is given by $\bar{M}$. The optimal permutations can be found by minimizing $- s \|\bar{M}\|^2$, or equivalently, by maximizing \begin{displaymath} s^2 \|\bar{M}\|^2 = \sum_{\beta, b} w_\beta w_b \trace(\Pi_\beta'M_\beta'M_b\Pi_b). \end{displaymath} With $U_{\beta,b} = w_\beta w_b M_\beta' M_b$ we can rewrite the above as \begin{displaymath} \sum_{\beta, b} w_\beta w_b \trace(\Pi_\beta'M_\beta'M_b\Pi_b) = \sum_{\beta,b} \sum_{j=1}^k [U_{\beta,b}]_{\pi_\beta(j), \pi_b(j)} =: \sum_{j=1}^k c_{\pi_1(j), \ldots, \pi_B(j)} \end{displaymath} This is an instance of the \emph{multi-dimensional assignment problem} (MAP), which, contrary to the LSAP, is known to be NP-hard \citep[e.g., via reduction to 3-DIMENSIONAL MATCHING,][]{cluster:Garey+Johnson:1979}, and can e.g.\ be approached using randomized parallel algorithms \citep{cluster:Oliveira+Pardalos:2004}. Branch-and-bound approaches suggested in the literature \citep[e.g.,][]{cluster:Grundel+Oliveira+Pardalos:2005} are unfortunately computationally infeasible for ``typical'' sizes of cluster ensembles ($B \ge 20$, maybe even in the hundreds). Package~\pkg{clue} provides two heuristics for (approximately) finding the soft least squares consensus partition for Euclidean dissimilarity. Method \code{"DWH"} of function \code{cl\_consensus()} is an extension of the greedy algorithm in \cite{cluster:Dimitriadou+Weingessel+Hornik:2002} which is based on a single forward pass through the ensemble which in each step chooses the ``locally'' optimal $\Pi$. Starting with $\tilde{M}_1 = M_1$, $\tilde{M}_b$ is obtained from $\tilde{M}_{b-1}$ by optimally matching $M_b \Pi_b$ to this, and taking a weighted average of $\tilde{M}_{b-1}$ and $M_b \Pi_b$ in a way that $\tilde{M}_b$ is the weighted average of the first~$b$ $M_\beta \Pi_\beta$. This simple approach could be further enhanced via back-fitting or several passes, in essence resulting in an ``on-line'' version of method \code{"SE"}. This, in turn, is a fixed-point algorithm, which iterates between updating $M$ as the weighted average of the current $M_b \Pi_b$, and determining the $\Pi_b$ by optimally matching the current $M$ to the individual $M_b$. Finally, method \code{"GV1"} implements the fixed-point algorithm for the ``first model'' in \cite{cluster:Gordon+Vichi:2001}, which gives least squares consensus partitions for GV1 dissimilarity. In the above, we implicitly assumed that all partitions in the ensemble as well as the sought consensus partition have the same number of classes. The more general case can be dealt with through suitable ``projection'' devices. When using co-membership dissimilarity, the least squares consensus partition is determined by minimizing \begin{eqnarray*} \lefteqn{\sum_b w_b \|MM' - M_bM_b'\|^2} \\ &=& s \|MM' - \bar{C}\|^2 + \sum_b w_b \|M_bM_b'\|^2 - s \|\bar{C}\|^2 \end{eqnarray*} over all membership matrices~$M$, where now $\bar{C} = s^{-1} \sum_b C(M_b) = s^{-1} \sum_b M_bM_b'$ is the weighted average co-membership matrix of the ensemble. This corresponds to the ``third model'' in \cite{cluster:Gordon+Vichi:2001}. Method \code{"GV3"} of function \code{cl\_consensus()} provides a SUMT approach (see Section~\ref{SUMT} on Page~\pageref{SUMT}) for finding the minimum. We note that this strategy could more generally be applied to consensus problems of the form \begin{displaymath} \sum_b w_b \|\Phi(M) - \Phi(M_b)\|^2 \Rightarrow \min\nolimits_M, \end{displaymath} which are equivalent to minimizing $\|\Phi(B) - \bar{\Phi}\|^2$, with $\bar{\Phi}$ the weighted average of the $\Phi(M_b)$. This includes e.g.\ the case where generalized co-memberships are defined by taking the ``standard'' fuzzy intersection of co-incidences, as discussed in Section~\ref{fuzzy} on Page~\pageref{fuzzy}. Package~\pkg{clue} currently does not provide algorithms for obtaining \emph{hard} consensus partitions, as e.g.\ done in \cite{cluster:Krieger+Green:1999} using Rand proximity. It seems ``natural'' to extend the methods discussed above to include a constraint on softness, e.g., on the partition coefficient PC (see Section~\ref{PC} on Page~\pageref{PC}). For Euclidean dissimilarity, straightforward Lagrangian computations show that the constrained minima are of the form $\bar{M}(\alpha) = \alpha \bar{M} + (1 - \alpha) E$, where $E$ is the ``maximally soft'' membership with all entries equal to $1/k$, $\bar{M}$ is again the weighted average of the $M_b\Pi_b$ with the $\Pi_b$ solving the underlying MAP, and $\alpha$ is chosen such that $PC(\bar{M}(\alpha))$ equals a prescribed value. As $\alpha$ increases (even beyond one), softness of the $\bar{M}(\alpha)$ decreases. However, for $\alpha^* > 1 / (1 - k\mu^*)$, where $\mu^*$ is the minimum of the entries of $\bar{M}$, the $\bar{M}(\alpha)$ have negative entries, and are no longer feasible membership matrices. Obviously, the non-negativity constraints for the $\bar{M}(\alpha)$ eventually put restrictions on the admissible $\Pi_b$ in the underlying MAP. Thus, such a simple relaxation approach to obtaining optimal hard partitions is not feasible. For ensembles of hierarchies, \code{cl\_consensus()} provides a built-in method (\code{"cophenetic"}) for approximately minimizing average weighted squared Euclidean dissimilarity \begin{displaymath} \sum_b w_b \| U - U_b \|^2 \Rightarrow \min\nolimits_U \end{displaymath} over all ultrametrics~$U$, where $U_1, \ldots, U_B$ are the ultrametrics corresponding to the elements of the ensemble. This is of course equivalent to minimizing $\| U - \bar{U} \|^2$, where $\bar{U} = s^{-1} \sum_b w_b U_b$ is the weighted average of the $U_b$. The SUMT approach provided by function \code{ls\_fit\_ultrametric()} (see Section~\ref{SUMT} on Page~\pageref{SUMT}) is employed for finding the sought weighted least squares consensus hierarchy. In addition, method \code{"majority"} obtains a consensus hierarchy from an extension of the majority consensus tree of \cite{cluster:Margush+McMorris:1981}, which minimizes $L(U) = \sum_b w_b d(U_b, U)$ over all ultrametrics~$U$, where $d$ is the symmetric difference dissimilarity. Clearly, the available methods use heuristics for solving hard optimization problems, and cannot be guaranteed to find a global optimum. Standard practice would recommend to use the best solution found in ``sufficiently many'' replications of the methods. Alternative recent approaches to obtaining consensus partitions include ``Bagged Clustering'' \citep[provided by \code{bclust()} in package~\pkg{e1071}]{cluster:Leisch:1999}, the ``evidence accumulation'' framework of \cite{cluster:Fred+Jain:2002}, the NMI optimization and graph-partitioning methods in \cite{cluster:Strehl+Ghosh:2003a}, ``Bagged Clustering'' as in \cite{cluster:Dudoit+Fridlyand:2003}, and the hybrid bipartite graph formulation of \cite{cluster:Fern+Brodley:2004}. Typically, these approaches are constructive, and can easily be implemented based on the infrastructure provided by package~\pkg{clue}. Evidence accumulation amounts to standard hierarchical clustering of the average co-membership matrix. Procedure~BagClust1 of \cite{cluster:Dudoit+Fridlyand:2003} amounts to computing $B^{-1} \sum_b M_b\Pi_b$, where each $\Pi_b$ is determined by optimal Euclidean matching of $M_b$ to a fixed reference membership $M_0$. In the corresponding ``Bagged Clustering'' framework, $M_0$ and the $M_b$ are obtained by applying the base clusterer to the original data set and bootstrap samples from it, respectively. This is implemented as method \code{"DFBC1"} of \code{cl\_bag()} in package~\pkg{clue}. Finally, the approach of \cite{cluster:Fern+Brodley:2004} solves an LSAP for an asymmetric cost matrix based on object-by-all-classes incidences. \subsection{Cluster partitions} To investigate the ``structure'' in a cluster ensemble, an obvious idea is to start clustering the clusterings in the ensemble, resulting in ``secondary'' clusterings \citep{cluster:Gordon+Vichi:1998, cluster:Gordon:1999}. This can e.g.\ be performed by using \code{cl\_dissimilarity()} (or \code{cl\_agreement()}) to compute a dissimilarity matrix for the ensemble, and feed this into a dissimilarity-based clustering algorithm (such as \code{pam()} in package~\pkg{cluster} or \code{hclust()} in package~\pkg{stats}). (One can even use \code{cutree()} to obtain hard partitions from hierarchies thus obtained.) If prototypes (``typical clusterings'') are desired for partitions of clusterings, they can be determined post-hoc by finding suitable consensus clusterings in the classes of the partition, e.g., using \code{cl\_consensus()} or \code{cl\_medoid()}. Package~\pkg{clue} additionally provides \code{cl\_pclust()} for direct prototype-based partitioning based on minimizing criterion functions of the form $\sum w_b u_{bj}^m d(x_b, p_j)^e$, the sum of the case-weighted membership-weighted $e$-th powers of the dissimilarities between the elements~$x_b$ of the ensemble and the prototypes~$p_j$, for suitable dissimilarities~$d$ and exponents~$e$. (The underlying feature spaces are that of membership matrices and ultrametrics, respectively, for partitions and hierarchies.) Parameter~$m$ must not be less than one and controls the softness of the obtained partitions, corresponding to the \dQuote{fuzzification parameter} of the fuzzy $c$-means algorithm. For $m = 1$, a generalization of the Lloyd-Forgy variant \citep{cluster:Lloyd:1957, cluster:Forgy:1965, cluster:Lloyd:1982} of the $k$-means algorithm is used, which iterates between reclassifying objects to their closest prototypes, and computing new prototypes as consensus clusterings for the classes. \citet{cluster:Gaul+Schader:1988} introduced this procedure for \dQuote{Clusterwise Aggregation of Relations} (with the same domains), containing equivalence relations, i.e., hard partitions, as a special case. For $m > 1$, a generalization of the fuzzy $c$-means recipe \citep[e.g.,][]{cluster:Bezdek:1981} is used, which alternates between computing optimal memberships for fixed prototypes, and computing new prototypes as the suitably weighted consensus clusterings for the classes. This procedure is repeated until convergence occurs, or the maximal number of iterations is reached. Consensus clusterings are computed using (one of the methods provided by) \code{cl\_consensus}, with dissimilarities~$d$ and exponent~$e$ implied by method employed, and obtained via a registration mechanism. The default methods compute Least Squares Euclidean consensus clusterings, i.e., use Euclidean dissimilarity~$d$ and $e = 2$. \section{Examples} \label{sec:examples} \subsection{Cassini data} \cite{cluster:Dimitriadou+Weingessel+Hornik:2002} and \cite{cluster:Leisch:1999} use Cassini data sets to illustrate how e.g.\ suitable aggregation of base $k$-means results can reveal underlying non-convex structure which cannot be found by the base algorithm. Such data sets contain points in 2-dimensional space drawn from the uniform distribution on 3 structures, with the two ``outer'' ones banana-shaped and the ``middle'' one a circle, and can be obtained by function~\code{mlbench.cassini()} in package~\pkg{mlbench}~\citep{cluster:Leisch+Dimitriadou:2005}. Package~\pkg{clue} contains the data sets \code{Cassini} and \code{CKME}, which are an instance of a 1000-point Cassini data set, and a cluster ensemble of 50 $k$-means partitions of the data set into three classes, respectively. The data set is shown in Figure~\ref{fig:Cassini}. <>= data("Cassini") plot(Cassini$x, col = as.integer(Cassini$classes), xlab = "", ylab = "") @ % $ \begin{figure} \centering <>= <> @ % \caption{The Cassini data set.} \label{fig:Cassini} \end{figure} Figure~\ref{fig:CKME} gives a dendrogram of the Euclidean dissimilarities of the elements of the $k$-means ensemble. <>= data("CKME") plot(hclust(cl_dissimilarity(CKME)), labels = FALSE) @ % \begin{figure} \centering <>= <> @ % \caption{A dendrogram of the Euclidean dissimilarities of 50 $k$-means partitions of the Cassini data into 3 classes.} \label{fig:CKME} \end{figure} We can see that there are large groups of essentially identical $k$-means solutions. We can gain more insight by inspecting representatives of these three groups, or by computing the medoid of the ensemble <<>>= m1 <- cl_medoid(CKME) table(Medoid = cl_class_ids(m1), "True Classes" = Cassini$classes) @ % $ and inspecting it (Figure~\ref{fig:Cassini-medoid}): <>= plot(Cassini$x, col = cl_class_ids(m1), xlab = "", ylab = "") @ % $ \begin{figure} \centering <>= <> @ % \caption{Medoid of the Cassini $k$-means ensemble.} \label{fig:Cassini-medoid} \end{figure} Flipping this solution top-down gives a second ``typical'' partition. We see that the $k$-means base clusterers cannot resolve the underlying non-convex structure. For the least squares consensus of the ensemble, we obtain <<>>= set.seed(1234) m2 <- cl_consensus(CKME) @ % where here and below we set the random seed for reproducibility, noting that one should really use several replicates of the consensus heuristic. This consensus partition has confusion matrix <<>>= table(Consensus = cl_class_ids(m2), "True Classes" = Cassini$classes) @ % $ and class details as displayed in Figure~\ref{fig:Cassini-mean}: <>= plot(Cassini$x, col = cl_class_ids(m2), xlab = "", ylab = "") @ % $ \begin{figure} \centering <>= <> @ % \caption{Least Squares Consensus of the Cassini $k$-means ensemble.} \label{fig:Cassini-mean} \end{figure} This has drastically improved performance, and almost perfect recovery of the two outer shapes. In fact, \cite{cluster:Dimitriadou+Weingessel+Hornik:2002} show that almost perfect classification can be obtained by suitable combinations of different base clusterers ($k$-means, fuzzy $c$-means, and unsupervised fuzzy competitive learning). \subsection{Gordon-Vichi macroeconomic data} \citet[Table~1]{cluster:Gordon+Vichi:2001} provide soft partitions of 21 countries based on macroeconomic data for the years 1975, 1980, 1985, 1990, and 1995. These partitions were obtained using fuzzy $c$-means on measurements of the following variables: the annual per capita gross domestic product (GDP) in USD (converted to 1987 prices); the percentage of GDP provided by agriculture; the percentage of employees who worked in agriculture; and gross domestic investment, expressed as a percentage of the GDP. Table~5 in \cite{cluster:Gordon+Vichi:2001} gives 3-class consensus partitions obtained by applying their models 1, 2, and 3 and the approach in \cite{cluster:Sato+Sato:1994}. The partitions and consensus partitions are available in data sets \code{GVME} and \code{GVME\_Consensus}, respectively. We compare the results of \cite{cluster:Gordon+Vichi:2001} using GV1 dissimilarities (model 1) to ours as obtained by \code{cl\_consensus()} with method \code{"GV1"}. <<>>= data("GVME") GVME set.seed(1) m1 <- cl_consensus(GVME, method = "GV1", control = list(k = 3, verbose = TRUE)) @ % This results in a soft partition with average squared GV1 dissimilarity (the criterion function to be optimized by the consensus partition) of <<>>= mean(cl_dissimilarity(GVME, m1, "GV1") ^ 2) @ % We compare this to the consensus solution given in \cite{cluster:Gordon+Vichi:2001}: <<>>= data("GVME_Consensus") m2 <- GVME_Consensus[["MF1/3"]] mean(cl_dissimilarity(GVME, m2, "GV1") ^ 2) table(CLUE = cl_class_ids(m1), GV2001 = cl_class_ids(m2)) @ % Interestingly, we are able to obtain a ``better'' solution, which however agrees with the one reported on the literature with respect to their nearest hard partitions. For the 2-class consensus partition, we obtain <<>>= set.seed(1) m1 <- cl_consensus(GVME, method = "GV1", control = list(k = 2, verbose = TRUE)) @ which is slightly better than the solution reported in \cite{cluster:Gordon+Vichi:2001} <<>>= mean(cl_dissimilarity(GVME, m1, "GV1") ^ 2) m2 <- GVME_Consensus[["MF1/2"]] mean(cl_dissimilarity(GVME, m2, "GV1") ^ 2) @ but in fact agrees with it apart from rounding errors: <<>>= max(abs(cl_membership(m1) - cl_membership(m2))) @ It is interesting to compare these solutions to the Euclidean 2-class consensus partition for the GVME ensemble: <<>>= m3 <- cl_consensus(GVME, method = "GV1", control = list(k = 2, verbose = TRUE)) @ This is markedly different from the GV1 consensus partition <<>>= table(GV1 = cl_class_ids(m1), Euclidean = cl_class_ids(m3)) @ with countries <<>>= rownames(m1)[cl_class_ids(m1) != cl_class_ids(m3)] @ % classified differently, being with the ``richer'' class for the GV1 and the ``poorer'' for the Euclidean consensus partition. (In fact, all these countries end up in the ``middle'' class for the 3-class GV1 consensus partition.) \subsection{Rosenberg-Kim kinship terms data} \cite{cluster:Rosenberg+Kim:1975} describe an experiment where perceived similarities of the kinship terms were obtained from six different ``sorting'' experiments. In one of these, 85 female undergraduates at Rutgers University were asked to sort 15 English terms into classes ``on the basis of some aspect of meaning''. These partitions were printed in \citet[Table~7.1]{cluster:Rosenberg:1982}. Comparison with the original data indicates that the partition data have the ``nephew'' and ``niece'' columns interchanged, which is corrected in data set \code{Kinship82}. \citet[Table~6]{cluster:Gordon+Vichi:2001} provide consensus partitions for these data based on their models 1--3 (available in data set \code{Kinship82\_Consensus}). We compare their results using co-membership dissimilarities (model 3) to ours as obtained by \code{cl\_consensus()} with method \code{"GV3"}. <<>>= data("Kinship82") Kinship82 set.seed(1) m1 <- cl_consensus(Kinship82, method = "GV3", control = list(k = 3, verbose = TRUE)) @ % This results in a soft partition with average co-membership dissimilarity (the criterion function to be optimized by the consensus partition) of <<>>= mean(cl_dissimilarity(Kinship82, m1, "comem") ^ 2) @ % Again, we compare this to the corresponding consensus solution given in \cite{cluster:Gordon+Vichi:2001}: <<>>= data("Kinship82_Consensus") m2 <- Kinship82_Consensus[["JMF"]] mean(cl_dissimilarity(Kinship82, m2, "comem") ^ 2) @ % Interestingly, again we obtain a (this time only ``slightly'') better solution, with <<>>= cl_dissimilarity(m1, m2, "comem") table(CLUE = cl_class_ids(m1), GV2001 = cl_class_ids(m2)) @ % indicating that the two solutions are reasonably close, even though <<>>= cl_fuzziness(cl_ensemble(m1, m2)) @ % shows that the solution found by \pkg{clue} is ``softer''. \subsection{Miller-Nicely consonant phoneme confusion data} \cite{cluster:Miller+Nicely:1955} obtained the data on the auditory confusions of 16 English consonant phonemes by exposing female subjects to a series of syllables consisting of one of the consonants followed by the vowel `a' under 17 different experimental conditions. Data set \code{Phonemes} provides consonant misclassification probabilities (i.e., similarities) obtained from aggregating the six so-called flat-noise conditions in which only the speech-to-noise ratio was varied into a single matrix of misclassification frequencies. These data are used in \cite{cluster:DeSoete:1986} as an illustration of the SUMT approach for finding least squares optimal fits to dissimilarities by ultrametrics. We can reproduce this analysis as follows. <<>>= data("Phonemes") d <- as.dist(1 - Phonemes) @ % (Note that the data set has the consonant misclassification probabilities, i.e., the similarities between the phonemes.) <<>>= u <- ls_fit_ultrametric(d, control = list(verbose = TRUE)) @ % This gives an ultrametric~$u$ for which Figure~\ref{fig:Phonemes} plots the corresponding dendrogram, ``basically'' reproducing Figure~1 in \cite{cluster:DeSoete:1986}. <>= plot(u) @ % \begin{figure} \centering <>= <> @ % \caption{Dendrogram for least squares fit to the Miller-Nicely consonant phoneme confusion data.} \label{fig:Phonemes} \end{figure} We can also compare the least squares fit obtained to that of other hierarchical clusterings of $d$, e.g.\ those obtained by \code{hclust()}. The ``optimal''~$u$ has Euclidean dissimilarity <<>>= round(cl_dissimilarity(d, u), 4) @ % to $d$. For the \code{hclust()} results, we get <<>>= hclust_methods <- c("ward", "single", "complete", "average", "mcquitty") hens <- cl_ensemble(list = lapply(hclust_methods, function(m) hclust(d, m))) names(hens) <- hclust_methods round(sapply(hens, cl_dissimilarity, d), 4) @ % which all exhibit greater Euclidean dissimilarity to $d$ than $u$. (We exclude methods \code{"median"} and \code{"centroid"} as these do not yield valid hierarchies.) We can also compare the ``structure'' of the different hierarchies, e.g.\ by looking at the rate of inversions between them: <<>>= ahens <- c(L2opt = cl_ensemble(u), hens) round(cl_dissimilarity(ahens, method = "gamma"), 2) @ % \section{Outlook} \label{sec:outlook} Package~\pkg{clue} was designed as an \emph{extensible} environment for computing on cluster ensembles. It currently provides basic data structures for representing partitions and hierarchies, and facilities for computing on these, including methods for measuring proximity and obtaining consensus and ``secondary'' clusterings. Many extensions to the available functionality are possible and in fact planned (some of these enhancements were already discussed in more detail in the course of this paper). \begin{itemize} \item Provide mechanisms to generate cluster ensembles based on reweighting (assuming base clusterers allowing for case weights) the data set. \item Explore recent advances (e.g., parallelized random search) in heuristics for solving the multi-dimensional assignment problem. \item Add support for \emph{additive trees} \citep[e.g.,][]{cluster:Barthelemy+Guenoche:1991}. \item Add heuristics for finding least squares fits based on iterative projection on convex sets of constraints, see e.g.\ \cite{cluster:Hubert+Arabie+Meulman:2006} and the accompanying MATLAB code available at \url{http://cda.psych.uiuc.edu/srpm_mfiles} for using these methods (instead of SUMT approaches) to fit ultrametrics and additive trees to proximity data. \item Add an ``$L_1$ View''. Emphasis in \pkg{clue}, in particular for obtaining consensus clusterings, is on using Euclidean dissimilarities (based on suitable least squares distances); arguably, more ``robust'' consensus solutions should result from using Manhattan dissimilarities (based on absolute distances). Adding such functionality necessitates developing the corresponding structure theory for soft Manhattan median partitions. Minimizing average Manhattan dissimilarity between co-memberships and ultrametrics results in constrained $L_1$ approximation problems for the weighted medians of the co-memberships and ultrametrics, respectively, and could be approached by employing SUMTs analogous to the ones used for the $L_2$ approximations. \item Provide heuristics for obtaining \emph{hard} consensus partitions. \item Add facilities for tuning hyper-parameters (most prominently, the number of classes employed) and ``cluster validation'' of partitioning algorithms, as recently proposed by \cite{cluster:Roth+Lange+Braun:2002}, \cite{cluster:Lange+Roth+Braun:2004}, \cite{cluster:Dudoit+Fridlyand:2002}, and \cite{cluster:Tibshirani+Walther:2005}. \end{itemize} We are hoping to be able to provide many of these extensions in the near future. \subsubsection*{Acknowledgments} We are grateful to Walter B\"ohm for providing efficient C code for solving assignment problems. {\small \bibliographystyle{abbrvnat} \bibliography{cluster} } \end{document}