--- title: "Preparing data and choosing sampler settings" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Preparing data and choosing sampler settings} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Before calling `pgmm_rjmcmc()`, the analyst must specify the data orientation, scaling, missing-value handling, latent-factor dimension, cluster range, and starting covariance model. ## Matrix orientation `bpgmm` expects a numeric matrix with variables in rows and observations in columns. Many R data sets use the opposite convention, so transpose after selecting numeric variables. The package convention is \[ X = \begin{bmatrix} x_{11} & \cdots & x_{1n} \\ \vdots & & \vdots \\ x_{p1} & \cdots & x_{pn} \end{bmatrix}, \] where row \(j\) is variable \(j\) and column \(i\) is observation \(x_i\). ```{r} library(bpgmm) iris_numeric <- as.matrix(iris[, 1:4]) iris_labels <- as.integer(iris$Species) dim(iris_numeric) X <- t(iris_numeric) dim(X) ``` Rows now correspond to variables and columns correspond to observations. ```{r} rownames(X) X[, 1:3] ``` ## Check finite numeric inputs The sampler requires finite numeric values. Handle missing values before fitting. Common choices include complete-case filtering, domain-specific imputation, or fitting the model to a subset of variables with reliable measurements. ```{r} all(is.finite(X)) storage.mode(X) ``` ## Scale variables Mixture models are sensitive to measurement scale. If variables are measured in different units, standardizing each variable is usually a sensible default. For each variable \(j\), the usual transformation is \[ x_{ji}^{\mathrm{scaled}} = \frac{x_{ji} - \bar{x}_{j\cdot}}{s_j}, \qquad s_j^2 = \frac{1}{n - 1}\sum_{i=1}^n (x_{ji} - \bar{x}_{j\cdot})^2 . \] ```{r} X_scaled <- t(scale(t(X))) round(rowMeans(X_scaled), 6) round(apply(X_scaled, 1, sd), 6) ``` The package does not scale internally because some scientific applications need the original measurement scale. Scaling should be an explicit analysis choice. ## Choose `q_new` `q_new` is the latent-factor dimension assigned to newly proposed clusters. It controls the dimension of the factor-analyzer part of the covariance model. In the paper's notation, \[ \Lambda_k \in \mathbb{R}^{p \times q_k}, \qquad y_{ki} \in \mathbb{R}^{q_k}. \] The package uses `q_new` as the \(q_k\) value for a newly created component. Useful starting points: - `q_new = 1` for very small examples or when covariance structure should be simple. - `q_new = 2` or `3` for moderate-dimensional data. - Larger values only when the number of variables and observations support the extra covariance flexibility. The value should be smaller than the number of observed variables. ```{r} p <- nrow(X_scaled) q_new <- min(2, p - 1) q_new ``` ## Choose `m_range` `m_range` is the allowed cluster-number range. A wide range gives RJMCMC more freedom, but also increases the model space. Start with a scientifically reasonable range, then assess sensitivity. ```{r} m_init <- 3 m_range <- c(1, 5) ``` For data with a known reference label, such as `iris`, the reference partition gives a simple check on the range. In unsupervised applications, use domain knowledge and exploratory plots. ```{r, fig.width = 5.5, fig.height = 4.2, fig.alt = "Scatter plot of scaled iris variables colored by species."} species_cols <- c("#0072B2", "#D55E00", "#009E73") plot( X_scaled[1, ], X_scaled[2, ], col = species_cols[iris_labels], pch = 19, xlab = rownames(X_scaled)[1], ylab = rownames(X_scaled)[2], main = "Scaled iris data", asp = 1 ) legend( "topleft", legend = levels(iris$Species), col = species_cols, pch = 19, bty = "n" ) ``` ## Choose the starting covariance model The three-letter model labels describe whether loading matrices and noise covariances are shared across clusters. `UUU` is flexible; `CCC` is more constrained. A flexible starting model is often reasonable when using `v_step = 1`, because the sampler can move across covariance structures. The fitted covariance is always \[ \Sigma_k = \Lambda_k\Lambda_k^\top + \Psi_k, \] but the label controls whether \(\Lambda_k\) and \(\Psi_k\) are shared and whether \(\Psi_k\) is isotropic or diagonal. ```{r} model_to_constraint("UUU") constraint_to_model(c(1, 1, 1)) ``` ## A prepared call The call below is not evaluated in the vignette because applied analyses should use longer chains and repeated runs. It records how the prepared objects enter the package interface. ```{r, eval = FALSE} fit <- pgmm_rjmcmc( X = X_scaled, m_init = m_init, m_range = m_range, q_new = q_new, burn = 1000, niter = 5000, constraint = "UUU", m_step = 1, v_step = 1, split_combine = 1, verbose = FALSE ) summarize_pgmm_rjmcmc(fit, true_cluster = iris_labels) ``` ## Practical checklist Before fitting: - verify that `X` is numeric with variables in rows; - remove or impute missing and non-finite values; - decide whether variables should be scaled; - choose a scientifically plausible `m_range`; - choose `q_new` smaller than the number of variables; - use multiple chains for applied analyses.