Predictor-Dependent Clustering with CausalMixGPD

Introduction

The clustering workflow within CausalMixGPD relies on Bayesian nonparametric mixture modeling similarly to the density estimation and causal workflows. However, the goal of inference shifts from predicting future responses based on the fitted model to inferring the underlying latent partition created by the mixing process. Thus, the outputs include the posterior similarity matrix (PSM), the cluster labels for some representative observations, cluster sizes, and cluster memberships scores (Binder 1978; Dahl 2006).

This aspect of CausalMixGPD’s functionality becomes crucial due to its capacity to discover subgroups using the latent random partition introduced by the Dirichlet process mixture and build flexible outcome models in each found cluster. Clustering is mentioned as one of the three core features of the package along with one-arm modeling and causal inference in the software article.

This vignette provides an overview of the clustering workflow of the package. First, we introduce the latent-partition notation and the types of predictor dependency that the package allows. Next, we run a clustering procedure on a built-in dataset, summarize the resulting random partition in terms of label-invariant objects, and predict cluster labels for some new observations. Finally, we provide a list of customizable elements specific to clustering.

Notation and short theory background

Mixture models and latent cluster labels

Let \(Y\) denote the response and let \(X\) denote the predictor vector. In a mixture model, the conditional density can be written abstractly as

\[ f(y \mid x) = \sum_{j=1}^{\infty} w_j(x) k(y \mid x, \theta_j), \]

where the weights may or may not depend on predictors and the component-specific parameters may or may not depend on predictors. Introduce a latent allocation variable \(z_i\) for observation \(i\), with \(z_i = j\) meaning that observation \(i\) is assigned to component \(j\). Then the conditional response model can be written as

\[ y_i \mid z_i, x_i, \{\theta_j\} \sim k(\cdot \mid x_i, \theta_{z_i}). \]

The collection \(\Pi = \{z_1, \ldots, z_n\}\) determines a random partition of the observations. In Bayesian nonparametric clustering, the posterior distribution of this partition is the central inferential object.

Why labels are not directly interpretable

Mixture component labels are not identifiable across MCMC draws. One posterior draw may label a cluster as component 1 while another labels the same substantive group as component 4. Because of this label-switching phenomenon, cluster summaries should not be based on raw component labels alone.

Instead, the package reports label-invariant summaries. The most important one is the posterior similarity matrix, whose \((i,\ell)\) entry is

\[ S_{i\ell} = \Pr(z_i = z_\ell \mid \text{data}). \]

This is estimated by averaging co-clustering indicators over retained posterior draws. A representative hard partition is then obtained using Dahl’s least-squares criterion (Dahl 2006).

Predictor dependence in clustering

The package supports three distinct clustering modes, each corresponding to a different way in which predictors can affect the mixture model.

Weight dependence

In the weight-dependent mode,

\[ f(y \mid x) = \sum_{j=1}^{\infty} w_j(x) k(y \mid \theta_j). \]

Here the cluster-specific kernel parameters are shared, but the prevalence of the clusters changes with predictors. This is useful when the goal is to let covariates modify which latent groups are more likely without changing the within-group model form.

Parameter dependence

In the parameter-dependent mode,

\[ f(y \mid x) = \sum_{j=1}^{\infty} w_j k(y \mid x, \theta_j). \]

The weights are global, but each cluster has its own predictor-response relationship. This is useful when clusters represent different regression regimes.

Weight and parameter dependence

In the most flexible mode,

\[ f(y \mid x) = \sum_{j=1}^{\infty} w_j(x) k(y \mid x, \theta_j). \]

Here both cluster prevalence and within-cluster behavior can vary with predictors.

Main clustering summaries returned by the package

The clustering wrappers produce fitted objects whose predictions are cluster-focused rather than density-focused.

These summaries allow the user to distinguish between strong clustering structure and uncertain partitions. In many applications, the PSM is more informative than a single hard partition because it shows which observations cluster together consistently and which do not.

Function map for the clustering workflow

The main exported functions used in this vignette are:

Package setup

library(CausalMixGPD)
library(ggplot2)
mcmc_vig <- list(
  niter = 1200,
  nburnin = 300,
  thin = 2,
  nchains = 2,
  seed = 2026
)
# Used when regenerating clustering vignette artifacts (tools/.Rscripts/build_vignette_fits.R).
mcmc_clust <- list(
  niter = 1200,
  nburnin = 300,
  thin = 2,
  nchains = 2,
  seed = 2029
)

Data

We use the bundled dataset nc_realX100_p3_k2, which contains a real-valued response and three predictors.

data("nc_realX100_p3_k2", package = "CausalMixGPD")
dat_cl <- data.frame(
  y = nc_realX100_p3_k2$y,
  nc_realX100_p3_k2$X
)

For illustration, we form a simple train/test split so that we can show both training-cluster summaries and out-of-sample label prediction. The chunk below shows the sampling code used for the shipped artifacts; the rendered row counts come from inst/extdata/clustering_outputs.rds in a hidden chunk.

set.seed(123)
idx_train <- sample(seq_len(nrow(dat_cl)), size = 80)
train_dat <- dat_cl[idx_train, ]
test_dat <- dat_cl[-idx_train, ]
nrow(train_dat)
nrow(test_dat)
#> [1] 80
#> [1] 20

A quick visualization of the response helps motivate the use of a flexible mixture model.

ggplot(train_dat, aes(x = y)) +
  geom_histogram(aes(y = after_stat(density)), bins = 25,
                 fill = "grey85", colour = "grey35") +
  geom_density(linewidth = 0.9) +
  labs(x = "y", y = "Density", title = "Training-sample response distribution") +
  theme_minimal()

Fitting a clustering model

The bulk-only clustering wrapper is dpmix.cluster(). It uses the same general mixture machinery as the one-arm interface, but returns a clustering-oriented fitted object.

fit_cluster <- dpmix.cluster(
  formula = y ~ x1 + x2 + x3,
  data = train_dat,
  kernel = "laplace",
  type = "both",
  components = 8,
  mcmc = mcmc_clust
)

The choice type = "both" requests dependence in both the mixing weights and the component-specific parameters. In practice, this is the most flexible clustering specification because it allows both cluster prevalence and cluster-specific regression behavior to vary with covariates.

Posterior similarity matrix

The first label-invariant summary is the PSM.

z_train_psm <- predict(fit_cluster, type = "psm")
z_train_psm
#> Cluster PSM
#> n         : 80 
#> components: 8 
#> draw_index: 124
plot(z_train_psm, type = "summary")

The block structure in the PSM is the main diagnostic for cluster separation. Darker within-block regions indicate pairs of observations that co-cluster consistently across posterior draws, while diffuse or weak block boundaries indicate uncertainty about the partition.

Representative training partition

The next step is to extract representative cluster labels for the training sample.

z_train_lab <- predict(fit_cluster, type = "label", return_scores = TRUE)
z_train_lab
#> Cluster labels (train)
#> n         : 80 
#> components: 8 
#> sizes     : 1:17, 3:15, 2:14, 5:11, 7:11, 6:7, 4:3, 8:2
plot(z_train_lab, type = "summary")

The representative partition provides a single interpretable clustering summary on the response scale. It is convenient for reporting, but it should be read together with the PSM because the latter conveys the posterior uncertainty in the clustering structure.

Predicting labels for new observations

A distinctive feature of the package is that it also supports cluster prediction for new observations.

z_test <- predict(fit_cluster, newdata = test_dat, type = "label", return_scores = TRUE)
z_test
#> Cluster labels (newdata)
#> n         : 20 
#> components: 8 
#> sizes     : 3:8, 2:7, 1:5
summary(
  predict(fit_cluster, newdata = test_dat, type = "label", return_scores = TRUE)
)$cluster_profiles
#>   cluster n    y_mean     y_sd    x1_mean     x1_sd      x2_mean     x2_sd
#> 1      C3 8 0.3647947 1.788025  0.7937694 0.3171469  0.138712164 0.4022500
#> 2      C2 7 0.5136280 1.420225 -0.5704476 0.4665936 -0.103106058 0.5569255
#> 3      C1 5 1.7266609 1.766122 -0.4420492 0.8613412  0.006034279 0.6154273
#>      x3_mean     x3_sd certainty_mean certainty_sd
#> 1  0.6107905 1.2048713      0.2412648   0.02468364
#> 2 -1.3820866 0.5893378      0.2418536   0.02644034
#> 3  0.6799718 0.7494218      0.3146114   0.09348034
plot(z_test, type = "sizes")

This prediction step maps new observations into the space of the representative training clusters. That is useful in applications where the discovered subgroup structure needs to be carried forward to new units without refitting the entire model.

Analysis summary

The clustering workflow in CausalMixGPD is most informative when the user interprets it as a posterior partition analysis rather than as a hard unsupervised classification algorithm. The PSM shows which observations reliably belong together, the representative labels provide a convenient summary partition, and the test-set predictions extend that partition to new observations.

The three predictor-dependence modes are scientifically meaningful rather than cosmetic. Weight dependence is useful when covariates mainly affect subgroup prevalence. Parameter dependence is useful when clusters correspond to different regression surfaces. The combined mode is appropriate when both mechanisms are plausible. In all three cases, the package reports label-invariant summaries so that posterior uncertainty about the partition is not hidden behind arbitrary component labels.

Customization options for clustering models

The software appendix gives a compact map of the clustering-specific controls; the most important ones are summarized here.

Clustering mode

type = "weights" places predictor dependence in the mixing weights.

type = "param" places predictor dependence in the component-specific kernel parameters.

type = "both" allows both forms of dependence simultaneously.

Wrapper choice

dpmix.cluster() fits the bulk-only clustering model.

dpmgpd.cluster() adds an active GPD tail and should be considered when tail behavior is part of the clustering problem.

Kernel, backend, and truncation

kernel selects the bulk family, subject to the support of the response.

components controls the truncation size. In clustering problems, under-truncation can hide meaningful subgroup structure, so this argument deserves explicit attention.

The package handles some backend details internally based on the clustering mode, but users still choose the scientific dependence structure through type.

Monitoring and output controls

monitor, monitor_latent, and monitor_v control how much posterior state is retained.

predict(type = "psm") and predict(type = "label") determine whether the user wants a matrix-valued co-clustering summary or representative labels.

return_scores = TRUE can be used when membership scores or certainty summaries are needed.

summary() reports cluster sizes and within-cluster profiles, while plot() supports PSM heatmaps and label-summary graphics such as cluster sizes and certainty.

Session information

sessionInfo()
#> R version 4.5.3 (2026-03-11 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_4.0.2      CausalMixGPD_0.7.0 nimble_1.4.0      
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6        jsonlite_2.0.0      dplyr_1.2.0        
#>  [4] compiler_4.5.3      tidyselect_1.2.1    parallel_4.5.3     
#>  [7] jquerylib_0.1.4     scales_1.4.0        yaml_2.3.12        
#> [10] fastmap_1.2.0       lattice_0.22-9      coda_0.19-4.1      
#> [13] R6_2.6.1            labeling_0.4.3      generics_0.1.4     
#> [16] igraph_2.2.1        knitr_1.51          tibble_3.3.0       
#> [19] bslib_0.10.0        pillar_1.11.1       RColorBrewer_1.1-3 
#> [22] rlang_1.1.7         cachem_1.1.0        xfun_0.56          
#> [25] sass_0.4.10         S7_0.2.1            viridisLite_0.4.3  
#> [28] cli_3.6.5           withr_3.0.2         magrittr_2.0.4     
#> [31] digest_0.6.39       grid_4.5.3          rstudioapi_0.18.0  
#> [34] lifecycle_1.0.5     vctrs_0.7.1         evaluate_1.0.5     
#> [37] pracma_2.4.6        glue_1.8.0          farver_2.1.2       
#> [40] numDeriv_2016.8-1.1 rmarkdown_2.30      tools_4.5.3        
#> [43] pkgconfig_2.0.3     htmltools_0.5.9
Binder, David A. 1978. Bayesian Cluster Analysis.” Biometrika 65 (1): 31–38. https://doi.org/10.1093/biomet/65.1.31.
Dahl, David B. 2006. Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model. Cambridge University Press.