--- title: "slideimp" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{slideimp} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(slideimp) ``` ## Choosing between K-NN and PCA imputation Let's use the included `khanmiss1` dataset. ```{r} data(khanmiss1) obj <- t(khanmiss1) set.seed(1234) ``` Instead of randomly placing `NA` values, we fix the locations of `NA` values in advance for cross-validation. We will select 30 random genes with 5 random missing values each and repeat 10 times to compare the imputation accuracy between `knn_imp()` and `pca_imp()`. The `na_location` object is a list of 10 elements (one per repetition), where each element contains the 2D coordinates (row and column) of values that will be set to missing and then re-imputed by the K-NN/PCA imputation functions to calculate imputation metrics. ```{r} n_genes <- 30 n_samples <- 5 n_reps <- 10 na_location <- replicate( n = n_reps, { chosen_genes <- sample.int(ncol(obj), size = n_genes) values <- lapply( chosen_genes, \(x) { chosen_samples <- sample.int(nrow(obj), size = n_samples) matrix(c(chosen_samples, rep(x, n_samples)), ncol = 2, dimnames = list(NULL, c("row", "col"))) } ) do.call(rbind, values) }, simplify = FALSE ) na_location[[1]][1:10, ] ``` We proceed to measure the accuracy of the imputation using "good" default parameters for K-NN (`k = 10`) and PCA (`ncp = 5`). For PCA imputation, we leverage the `{mirai}` package for parallelization while the `cores` argument control parallelization for K-NN. ```{r} tune_knn <- tune_imp(obj, parameters = data.frame(k = 10), rep = na_location, cores = 4) tune_pca <- tune_imp(obj, parameters = data.frame(ncp = 5), rep = na_location) ``` In this case, PCA imputation performs better. Since the data only needs to be imputed once and the run time is so low, using PCA imputation for this dataset is preferable. ```{r} rmse_knn <- mean(compute_metrics(tune_knn, metrics = "rmse")$.estimate) rmse_pca <- mean(compute_metrics(tune_pca, metrics = "rmse")$.estimate) c("knn" = rmse_knn, "pca" = rmse_pca) ``` ## Subset and grouped K-NN and PCA imputation Some DNA methylation analyses only focus on epigenetic clocks, which require only a subset of CpGs to be imputed. The `knn_imp()` function takes advantage of this to significantly reduce run time. To impute only specific CpGs for K-NN imputation and make use of grouped imputation we can use the `group_features()` function. Here, we simulate data where only a subset of CpGs are needed for imputation. PCA imputation does not benefit from subset imputation, but does benefit from grouped imputation. ```{r} # 500 features, 50 samples, 2 chromosomes sim_obj <- sim_mat(n = 500, m = 50, nchr = 2) # sim_mat returns matrix with features in rows, so we use t() to put features in columns obj <- t(sim_obj$input) # `group_feature` contains metadata for the simulated data head(sim_obj$group_feature) n_needed <- 100 # Randomly select CpGs that are needed for clock calculation subset_cpgs <- sample(colnames(obj), size = n_needed) ``` We define the `features` list-column as a list of character vectors containing features to be imputed. Elements of the `features` list-column are subsets of the `subset_cpgs` character vector. These CpGs belong to either `chr1` or `chr2`, so we split them into two groups, resulting in two rows of the `group_df` tibble. The `aux` list-column now contains the remaining CpGs from either `chr1` or `chr2` that are not imputed themselves but are used to increase the accuracy of imputing the CpGs in the `features` list-column. ```{r} # Construct the group data.frame with one row per group (i.e., chromosome) group_df <- group_features(obj, sim_obj$group_feature, subset = subset_cpgs) group_df ``` For demonstration purposes, we use the `parameters` column to add group-specific parameters. This can be useful if hyperparameters are tuned separately for each group, though this is typically not necessary. When `k` or `ncp` is too large for some groups, we can use the `parameters` column to specify group-specific `k` or `ncp` values. `k` and `ncp` can also be specified in `group_features()`. ```{r} pca_group <- group_df # For PCA imputation, we use ncp = 5 for the first group and ncp = 10 for the second group pca_group$parameters <- list(data.frame(ncp = 5), data.frame(ncp = 10)) # For K-NN imputation, we use k = 5 for both groups knn_group <- group_df knn_group$parameters <- list(data.frame(k = 5), data.frame(k = 5)) pca_group knn_group ``` K-NN subset imputation is very fast. ```{r} system.time( knn_results <- group_imp(obj, group = knn_group, cores = 4) ) system.time( pca_results <- group_imp(obj, group = pca_group) ) ``` ## Sliding Window K-NN and PCA imputation We simulate the output of the `{methylKit}` package. **Note:** WGBS/EM-seq data should be grouped by chromosome before performing sliding window imputation. ```{r} sample_names <- paste0("S", 1:10) positions <- 500 methyl <- tibble::tibble( chr = "chr1", start = seq_len(positions), end = start, strand = "+" ) set.seed(1234) for (i in seq_along(sample_names)) { methyl[[paste0("numCs", i)]] <- sample.int(100, size = positions, replace = TRUE) methyl[[paste0("numTs", i)]] <- sample.int(100, size = positions, replace = TRUE) methyl[[paste0("coverage", i)]] <- methyl[[paste0("numCs", i)]] + methyl[[paste0("numTs", i)]] } methyl[1:5, 1:10] ``` First, we convert the data into a beta matrix with features in the columns. **Important:** We sort the `methyl` object by position and name the CpGs by their genomic position. This is important because `slide_imp` imputes the beta matrix column-wise in windows so the order of the column has to be meaningful. ```{r} methyl <- methyl[order(methyl$start), ] # <---- Important, sort! numCs_matrix <- as.matrix(methyl[, paste0("numCs", seq_along(sample_names))]) cov_matrix <- as.matrix(methyl[, paste0("coverage", seq_along(sample_names))]) beta_matrix <- numCs_matrix / cov_matrix colnames(beta_matrix) <- sample_names rownames(beta_matrix) <- methyl$start beta_matrix <- t(beta_matrix) # Set 10% of the data to missing set.seed(1234) beta_matrix[sample.int(length(beta_matrix), floor(length(beta_matrix) * 0.1))] <- NA beta_matrix[1:5, 1:5] ``` This example dataset is small, but in actual analyses with millions of CpGs, we can tune the K-NN or PCA imputation on just the first tens of thousands of CpGs. Here, we perform K-NN imputation by specifying `k`, a window size (`n_feat`) of 100, and a 10-CpG overlap (`n_overlap`) between windows. ```{r} # For example, we are tuning `k` value here. slide_knn_params <- tibble::tibble(k = c(10, 20), n_feat = 100, n_overlap = 10) # Increase rep from 2 in actual analyses tune_slide_knn <- tune_imp( obj = beta_matrix, parameters = slide_knn_params, cores = 4, rep = 2 ) compute_metrics(tune_slide_knn) ``` Finally, we can use `slide_imp()` to impute the `beta_matrix`. ```{r} imputed_beta <- slide_imp(obj = beta_matrix, n_feat = 100, n_overlap = 10, k = 10, cores = 4) imputed_beta ```