--- title: "rCISSVAE Vignette" date: "`r format(Sys.Date(), '%Y-%m-%d')`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 fig_width: 7 fig_height: 5 dpi: 600 vignette: > %\VignetteIndexEntry{rCISSVAE Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, warning = FALSE, error = FALSE, fig.width = 7, fig.height = 5, dpi = 600, collapse = TRUE, comment = "#>" ) library(kableExtra) ``` ## Installing package Install devtools or remotes if not already installed: ```{r eval=FALSE} install.packages("remotes") # or install.packages("devtools") ``` The rCISSVAE package can be installed with: ```{r eval=FALSE} remotes::install_github("CISS-VAE/rCISS-VAE") # or devtools::install_github("CISS-VAE/rCISS-VAE") ``` ## Ensuring correct virtual environment for reticulate This package uses `reticulate` to interface with the python version of the package `ciss_vae`. Therefore, it is necessary to make sure that you have a venv or conda environment set up that has the `ciss_vae` package installed. If you are comfortable creating an environment and installing the package, great! Then all you need to do is tell reticulate where to point. **For Venv** ```{r eval=FALSE} reticulate::use_virtualenv("./.venv", required = TRUE) ``` **For conda** ```{r eval=FALSE} reticulate::use_condaenv("myenv", required = TRUE) ``` ### Virtual environment helper function If you do not want to manually create the virtual environment, you can use the helper function `create_cissvae_env()` to create a virtual environment (venv) in your current working directory. ```{r eval=FALSE} create_cissvae_env( envname = "./cissvae_environment", ## name of environment path = NULL, ## add path to wherever you want virtual environment to be install_python = FALSE, ## set to TRUE if you want create_cisssvae_env to install python for you python_version = "3.10" ## set to whatever version you want >=3.10. Python 3.10 or 3.11 recommended ) ``` > **Note:** If you run into issues with create_cissvae_env(), you can create the virtual environment manually by following this [tutorial](virtual_environment_tutorial.html) Once the environment is created, activate it using: ```{r eval=FALSE} reticulate::use_virtualenv("./cissvae_environment", required = TRUE) # If you used a non-default environment name then, # reticulate::use_virtualenv("./your_environment_name", required = TRUE) ``` ### (optional) Installing other python packages If you want to install other python packages (eg seaborn) to your environment, you can use [`reticulate::virtualenv_install()`](https://rstudio.github.io/reticulate/reference/virtualenv-tools.html). ### (optional) Use `check_devices()` to see available gpu devices Use `check_devices()` to see what cpu/gpu devices are available for model training. Optionally pass the path to your virtual environment using the parameter `env_path` to make sure that the correct environment is activated. ```{r check_devices, eval = FALSE} rCISSVAE::check_devices() ``` ``` Available Devices: • cpu (Main system processor) [1] "cpu" ``` # Quickstart Once reticulate is pointing to the virtual environment containing the `ciss_vae` python package, you can either use the `run_cissvae` function or the `autotune_cissvae` function. For details on using the CISSVAE model with binary/categorical values see [binary variables tutorial](https://ciss-vae.github.io/rCISS-VAE/articles/binary_variables_tutorial.html). If you know what hyperparameters you want to use for the model, use the `run_cissvae` function. ## Run CISSVAE with Training History and Progress Tracking Your data should be in a `DataFrame` format with optional index column. If you already have clusters you want to use, they should be in a separate vector from the dataframe. If you do not have clusters to begin with, set 'clusters' in `run_cissvae()` to NULL. ### Example Dataset With the rCISSVAE package comes a sample dataset with predetermined clusters. We will use this for the sake of the tutorial. The dataset, `df_missing` contains an index column as well as the following: ```{r show_dataset, echo=FALSE} suppressPackageStartupMessages({ library(tidyverse) library(reticulate) library(rCISSVAE) library(gtsummary) library(kableExtra) }) data(df_missing) data(clusters) tbl_summary(df_missing, include = -"index") %>% kable() %>% kable_styling(font_size = 12) ``` Age, Salary and ZipCode columns represent demographic data with no missingness and columns Y1t-Y5t represent biomarker data obtained at different timepoints t. ### Explanation of `run_cissvae()` Parameters The `run_cissvae()` function is a comprehensive wrapper for all basic steps in running the CISS-VAE model, including dataset preparation, optional clustering, and running the imputation model.
Click for a detailed explanation of all parameters **Dataset Parameters** - data: A `DataFrame` containing the dataset to be imputed. Contains optional index column. - index_col: Name of index column to be preserved when imputing. Index column will not have any values held out for validation - val_proportion: Fraction of non-missing entries to hold out for validation during training. To use different proportions for each cluster, pass a vector. - replacement_value: Fill value for masked entries during training. Default is 0.0. - columns_ignore: Character or integer vector containing columns to exclude when selecting validation data. These columns will be used during training. - print_dataset: Set `TRUE` to print dataset summary information during processing. **Clustering Parameters (optional)** - clusters: Vector of one cluster label per row of 'data' dataframe. If NULL, will automatically determine clusters using Leiden Clustering or [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). - n_clusters: Number of clusters for KMeans clustering when 'clusters' is NULL. If n_clusters is NULL, will use Leiden Clustering for clustering. - leiden_resolution: Resolution parameter for Leiden clustering. Defaults to 0.5. - k_neighbors: Number of nearest neighbors for the Leiden KNN graph construction. Defaults to 15. - leiden_objective: Objective function for Leiden clustering. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM". - missingness_proportion_matrix: Optional pre-computed missingness proportion matrix for feature-based clustering. If provided, clustering will be based on these proportions instead of direct 0/1 missingness pattern. - scale_features: Set `TRUE` to scale features when using missingness proportion matrix clustering. **Model Parameters** - hidden_dims: A vector containing the sizes of hidden layers in encoder/decoder. The length of this vector determines number of hidden layers. - latent_dim: The dimension of the latent space representation. - layer_order_enc: A vector stating the pattern of 'shared' and 'unshared' layers for the encoder. The length must match `length(hidden_dims)`. Default c('unshared', 'unshared', 'unshared'). - layer_order_dec: A vector stating the pattern of 'shared' and 'unshared' layers for the decoder. The length must match `length(hidden_dims)`. Default c('shared', 'shared', 'shared'). - latent_shared: Whether latent space weights are shared across clusters. If FALSE, will have separate latent weights for each cluster. - ouput_shared: If FALSE, will have separate output layer for each cluster. - batch_size: Integer. Mini-batch size for training. Larger values may improve training stability but require more memory. - return_model: If TRUE, returns the model object. Set TRUE to use `plot_vae_architecture()` after running. - epochs: Number of epochs for initial training phase - initial_lr: Initial learning rate for optimizer. - decay_factor: Exponential decay factor for learning rate. - beta: Weight for KL divergence term in VAE loss function. - device: Device specification for computation ("cpu" or "cuda"). If NULL, automatically selects best available device. - max-loops: Max number of impute-refit loops to perform. - patience: Training stops if validation loss doesn't improve for this many consecutive impute-refit loops. - epochs-per-loop: Number of epochs per refit loop. If null, uses same value as `epochs`. Default NULL. - decay_factor_refit: Decay factor for refit loops. If NULL, uses same value as decay_factor. Default NULL. - beta_refit: KL weight for refit loops. If NULL, uses same value as beta. Default NULL. **Optional Parameters** - verbose: Set `TRUE` to print MSE for each loop as it runs. - return_silhouettes: If clusters not given, will return silhouette scores for automatic clustering. - return_history: If `TRUE`, returns training history as data.frame. Good for checking for overfitting. - return_dataset: If `TRUE`, returns ClusterDataset object.
### Running the imputation To run the imputation model, first load your data. You can use the `cluster_summary()` function to visualize the missingness by cluster. The `cluster_summary()` function builds off of {[gtsummary](https://www.danieldsjoberg.com/gtsummary/index.html)}. ```{r eval=FALSE, message=FALSE, warning=FALSE} library(tidyverse) library(reticulate) library(rCISSVAE) library(gtsummary) ## Set correct virtualenv reticulate::use_virtualenv("./cissvae_environment", required = TRUE) ## Load the data data(df_missing) data(clusters) ## actual cluster labels in clusters$clusters (other column is index) cluster_summary( data = df_missing, clusters = clusters$clusters, include =setdiff(names(df_missing), "index"), statistic = list( all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} / {N}\n ({p}%)"), missing = "always") ``` ```{r showruncissvae2, echo=FALSE} suppressPackageStartupMessages({ library(tidyverse) library(rCISSVAE) library(gtsummary) }) data(df_missing) data(clusters) cluster_summary( data = df_missing, clusters = clusters$clusters, include =setdiff(names(df_missing), "index"), statistic = list( all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} / {N}\n ({p}%)"), missing = "always") ``` Then, plug your data and clusters into the `run_cissvae()` function. ```{r eval=FALSE, message=FALSE, warning=FALSE} ## Run the imputation model. dat = run_cissvae( data = df_missing, index_col = "index", val_proportion = 0.1, ## pass a vector for different proportions by cluster columns_ignore = c("Age", "Salary", "ZipCode10001", "ZipCode20002", "ZipCode30003"), ## If there are columns in addition to the index you want to ignore when selecting validation set, list them here. In this case, we ignore the 'demographic' columns because we do not want to remove data from them for validation purposes. clusters = clusters$clusters, ## we have precomputed cluster labels so we pass them here epochs = 500, return_silhouettes = FALSE, return_history = TRUE, # Get detailed training history verbose = FALSE, return_model = TRUE, ## Allows for plotting model schematic device = "cpu", # Explicit device selection layer_order_enc = c("unshared", "shared", "unshared"), layer_order_dec = c("shared", "unshared", "shared") ) ## Retrieve results imputed_df <- dat$imputed silhouette <- dat$silhouettes training_history <- dat$history # Detailed training progress ## Plot training progress if (!is.null(training_history)) { plot(training_history$epoch, training_history$loss, type = "l", main = "Training Loss Over Time", xlab = "Epoch", ylab = "Loss") } plot_vae_architecture(model = dat$model, save_path = "test_plot_arch.png") print(head(dat$imputed_dataset)) ``` ``` Cluster dataset: ClusterDataset(n_samples=8000, n_features=30, n_clusters=4) • Original missing: 61800 / 200000 (30.90%) • Validation held-out: 13783 (9.97% of non-missing) • .data shape: (8000, 30) • .masks shape: (8000, 30) • .val_data shape: (8000, 30) ``` ![](test_plot_arch.png) ```{r showres1, echo=FALSE} dat <- readRDS(system.file("extdata", "demo_run.rds", package = "rCISSVAE")) print(head(dat$imputed_dataset)) ``` ## Clustering Features by Missingness Patterns Before running CISS-VAE, you can cluster features based on their missingness patterns. This helps identify features that tend to be missing together systematically, which can improve imputation quality. ```{r clusfeatmissing, eval = FALSE} library(rCISSVAE) data(df_missing) cluster_result <- cluster_on_missing( data = df_missing, cols_ignore = c("index", "Age", "Salary", "ZipCode10001", "ZipCode20002", "ZipCode30003"), n_clusters = 4, # Use KMeans with 4 clusters seed = 42 ) cluster_summary(df_missing, factor(cluster_result$clusters), include = setdiff(names(df_missing), "index"), statistic = list( gtsummary::all_continuous() ~ "{mean} ({sd})", gtsummary::all_categorical() ~ "{n} / {N}\n ({p}%)"), missing = "always") cat(paste("Clustering quality (silhouette):", round(cluster_result$silhouette, 3))) result <- run_cissvae( data = df_missing, index_col = "index", clusters = cluster_result$clusters, return_history = TRUE, verbose = FALSE, device = "cpu" ) ``` ```{r showres2, echo=FALSE} library(gtsummary) cluster_summary(df_missing, factor(clusters$clusters), include = setdiff(names(df_missing), "index"), statistic = list( all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} / {N}\n ({p}%)"), missing = "always") cat("Clustering quality (silhouette): 0.135") ``` ### Using Pre-computed Missingness Proportion Matrix To create clusters based on proportion of missingness across all timepoints for a given feature, you can provide a pre-computed missingness proportion matrix (by using `create_missingness_prop_matrix()` or manually) directly to `run_cissvae()`: ```{r useprecompmissingprop, eval=FALSE, echo = TRUE} ## Standardize df_missing column names to feature_timepoint format colnames(df_missing) = c('index', 'Age', 'Salary', 'ZipCode10001', 'ZipCode20002', 'ZipCode30003', 'Y1_1', 'Y1_2', 'Y1_3', 'Y1_4', 'Y1_5', 'Y2_1', 'Y2_2', 'Y2_3', 'Y2_4', 'Y2_5', 'Y3_1', 'Y3_2', 'Y3_3', 'Y3_4', 'Y3_5', 'Y4_1', 'Y4_2', 'Y4_3', 'Y4_4', 'Y4_5', 'Y5_1', 'Y5_2', 'Y5_3', 'Y5_4', 'Y5_5') # Create and examine missingness proportion matrix prop_matrix <- create_missingness_prop_matrix(df_missing, index_col = "index", cols_ignore = c('Age', 'Salary', 'ZipCode10001', 'ZipCode20002', 'ZipCode30003'), repeat_feature_names = c("Y1", "Y2", "Y3", "Y4", "Y5")) cat("Missingness proportion matrix dimensions:\n") cat(dim(prop_matrix), "\n") cat("Sample of proportion matrix:\n") print(head(prop_matrix[, 1:5])) # Use proportion matrix with scaling for better clustering advanced_result <- run_cissvae( data = df_missing, index_col = "index", clusters = NULL, # Let function cluster using prop_matrix columns_ignore = c('Age', 'Salary', 'ZipCode10001', 'ZipCode20002', 'ZipCode30003'), missingness_proportion_matrix = prop_matrix, scale_features = TRUE, # Standardize features before clustering n_clusters = 4, leiden_resolution = 0.1, epochs = 5, return_history = TRUE, return_silhouettes = TRUE, device = "cpu", verbose = FALSE, return_clusters = TRUE ) print("Clustering quality:") print(paste("Silhouette score:", round(advanced_result$silhouette_width, 3))) ## Plotting imputation loss by epoch ggplot2::ggplot(data = advanced_result$training_history, aes(x = epoch, y = imputation_error)) + geom_point() + labs(y = "Imputation Loss", x = "Epoch") + theme_classic() ``` ```{r showres3, eval=TRUE, echo=FALSE} colnames(df_missing) = c('index', 'Age', 'Salary', 'ZipCode10001', 'ZipCode20002', 'ZipCode30003', 'Y1_1', 'Y1_2', 'Y1_3', 'Y1_4', 'Y1_5', 'Y2_1', 'Y2_2', 'Y2_3', 'Y2_4', 'Y2_5', 'Y3_1', 'Y3_2', 'Y3_3', 'Y3_4', 'Y3_5', 'Y4_1', 'Y4_2', 'Y4_3', 'Y4_4', 'Y4_5', 'Y5_1', 'Y5_2', 'Y5_3', 'Y5_4', 'Y5_5') # Create and examine missingness proportion matrix prop_matrix <- create_missingness_prop_matrix(df_missing, index_col = "index", cols_ignore = c('Age', 'Salary', 'ZipCode10001', 'ZipCode20002', 'ZipCode30003'), repeat_feature_names = c("Y1", "Y2", "Y3", "Y4", "Y5")) cat("Missingness proportion matrix dimensions:\n") cat(dim(prop_matrix), "\n") cat("Sample of proportion matrix:\n") print(head(prop_matrix[, 1:5])) ``` ```{r showres4, echo=FALSE} advanced_result = readRDS(system.file("extdata", "missing_prop.rds", package = "rCISSVAE")) # ar$silhouette_width # [1] 0.6557173 cat("Clustering quality:\n") cat(paste("Silhouette score:", round(0.6557173, 3)), "\n") ggplot2::ggplot(data = advanced_result, aes(x = epoch, y = imputation_error)) + geom_point() + labs(y = "Imputation Loss", x = "Epoch") + theme_classic() ``` ## Advanced Hyperparameter Optimization with Autotune ### Understanding Parameter Types For hyperparameter optimization in `autotune_cissvae()`, parameters can be specified as: - **Fixed value**: `beta = 0.01` → parameter remains constant across trials - **Categorical choice**: `c(64, 128, 256)` → Optuna selects from the provided options - **Float range**: `reticulate::tuple(1e-4, 1e-3)` → Optuna suggests floats in the specified range #### Layer Placement Strategies The layer arrangement strategies control how shared and unshared layers are positioned: - **`"at_end"`**: Places shared layers at the end of the encoder or start of the decoder - **`"at_start"`**: Places shared layers at the start of the encoder or end of the decoder - **`"alternating"`**: Distributes shared layers evenly throughout the architecture - **`"random"`**: Uses random placement of shared layers (with reproducible seed) ```{r eval=FALSE} library(tidyverse) library(reticulate) library(rCISSVAE) reticulate::use_virtualenv("./cissvae_environment", required = TRUE) data(df_missing) data(clusters) aut <- autotune_cissvae( data = df_missing, index_col = "index", clusters = clusters$clusters, save_model_path = NULL, save_search_space_path = NULL, n_trials = 3, ## Using low number of trials for demo study_name = "comprehensive_vae_autotune", device_preference = "cpu", show_progress = FALSE, # Set true for Rich progress bars with training visualization optuna_dashboard_db = "sqlite:///optuna_study.db", # Save results to database load_if_exists = FALSE, ## Set true to load and continue study if it exists seed = 42, verbose = FALSE, # Search strategy options constant_layer_size = FALSE, # Allow different sizes per layer evaluate_all_orders = FALSE, # Sample layer arrangements efficiently max_exhaustive_orders = 100, # Limit for exhaustive search ## Hyperparameter search space num_hidden_layers = c(2, 5), # Try 2-5 hidden layers hidden_dims = c(64, 512), # Layer sizes from 64 to 512 latent_dim = c(10, 100), # Latent dimension range latent_shared = c(TRUE, FALSE), output_shared = c(TRUE, FALSE), lr = 0.01, # Learning rate range decay_factor = 0.99, beta = 0.01, # KL weight range num_epochs = 500, # Fixed epochs for demo batch_size = c(1000, 4000), # Batch size options num_shared_encode = c(0, 1, 2, 3), num_shared_decode = c(0, 1, 2, 3), # Layer placement strategies - try different arrangements encoder_shared_placement = c("at_end", "at_start", "alternating", "random"), decoder_shared_placement = c("at_start", "at_end", "alternating", "random"), refit_patience = 2, # Early stopping patience refit_loops = 100, # Fixed refit loops epochs_per_loop = 100, # Epochs per refit loop reset_lr_refit = c(TRUE, FALSE) ) # Analyze results imputed <- aut$imputed best_model <- aut$model study <- aut$study results <- aut$results # View best hyperparameters print("Trial results:") results %>% kable() %>% kable_styling(font_size=12) # Plot model architecture plot_vae_architecture(best_model, title = "Optimized CISSVAE Architecture") ``` ```{r showres5, echo=FALSE} results = readRDS(system.file("extdata", "autotune.rds", package = "rCISSVAE")) # View best hyperparameters print("Trial results:") results %>% kableExtra::kable() %>% kableExtra::kable_styling(font_size=12) ``` ![Autotuned VAE Architecture](autotune_vae_arch.png) For more information on using optuna dashboard, see the [Optuna Dashboard tutorial](https://ciss-vae.github.io/rCISS-VAE/articles/optunadb.html)