This document shows the basics of applying our Bayesian model-based
clustering/classification with joint batch correction in
R. It shows how to
generate some toy data, apply the model, assess convergence and process outputs.
We simulate some data using the
library(ggplot2) library(batchmix) # Data dimensions N <- 600 P <- 4 K <- 5 B <- 7 # Generating model parameters mean_dist <- 2.25 batch_dist <- 0.3 group_means <- seq(1, K) * mean_dist batch_shift <- rnorm(B, mean = batch_dist, sd = batch_dist) std_dev <- rep(2, K) batch_var <- rep(1.2, B) group_weights <- rep(1 / K, K) batch_weights <- rep(1 / B, B) dfs <- c(4, 7, 15, 60, 120) my_data <- generateBatchData( N, P, group_means, std_dev, batch_shift, batch_var, group_weights, batch_weights, type = "MVT", group_dfs = dfs )
This gives us a named list with two related datasets, the
which includes batch effects and the
corrected_data which is batch-free. It
group_IDs, a vector indicating class membership for each item,
batch_IDs, which indicates batch of origin for each item, and
which indicates which labels are observed and fixed in the model. We pull these
out of the names list in the format that the modelling functions desire them.
X <- my_data$observed_data true_labels <- my_data$group_IDs fixed <- my_data$fixed batch_vec <- my_data$batch_IDs alpha <- 1 initial_labels <- generateInitialLabels(alpha, K, fixed, true_labels)
Given some data, we are interested in modelling it. We assume here that the set of observed labels includes at least one example of each class in the data.
# Sampling parameters R <- 1000 thin <- 50 n_chains <- 4 # Density choice type <- "MVT" # MCMC samples and BIC vector mcmc_output <- runMCMCChains( X, n_chains, R, thin, batch_vec, type, initial_labels = initial_labels, fixed = fixed )
We want to assess two things. First, how frequently the proposed parameters in the Metropolis-Hastings step are accepted:
Secondly, we want to asses how well our chains have converged. To do this we plot the
complete_likelihood of each chain. This is the quantity most relevant to a clustering/classification, being dependent on the labels. The
observed_likelihood is independent of labels and more relevant for density estimation.
We see that our chains disagree. We have to run them for more iterations. We use the
continueChains function for this.
R_new <- 9000 # Given an initial value for the parameters new_output <- continueChains( mcmc_output, X, fixed, batch_vec, R_new, keep_old_samples = TRUE )
To see if the chains better agree we re-plot the likelihood.
We also re-check the acceptance rates.
This looks like several of the chains agree by the 5,000th iteration.
We process the chains, acquiring point estimates of different quantities.
# Burn in burn <- 5000 # Process the MCMC samples processed_samples <- processMCMCChains(new_output, burn)
For multidimensional data we use a PCA plot.
chain_used <- processed_samples[] pc <- prcomp(X, scale = T) pc_batch_corrected <- prcomp(chain_used$inferred_dataset) plot_df <- data.frame( PC1 = pc$x[, 1], PC2 = pc$x[, 2], PC1_bf = pc_batch_corrected$x[, 1], PC2_bf = pc_batch_corrected$x[, 2], pred_labels = factor(chain_used$pred), true_labels = factor(true_labels), prob = chain_used$prob, batch = factor(batch_vec) ) plot_df |> ggplot(aes( x = PC1, y = PC2, colour = true_labels, alpha = prob )) + geom_point()
plot_df |> ggplot(aes( x = PC1_bf, y = PC2_bf, colour = pred_labels, alpha = prob )) + geom_point()
test_inds <- which(fixed == 0) sum(true_labels[test_inds] == chain_used$pred[test_inds]) / length(test_inds)
##  0.806383