---
title: "slideimp"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{slideimp}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(slideimp)
```
## Choosing between K-NN and PCA imputation
Let's use the included `khanmiss1` dataset.
```{r}
data(khanmiss1)
obj <- t(khanmiss1)
set.seed(1234)
```
Instead of randomly placing `NA` values, we fix the locations of `NA` values in advance for cross-validation. We will select 30 random genes with 5 random missing values each and repeat 10 times to compare the imputation accuracy between `knn_imp()` and `pca_imp()`. The `na_location` object is a list of 10 elements (one per repetition), where each element contains the 2D coordinates (row and column) of values that will be set to missing and then re-imputed by the K-NN/PCA imputation functions to calculate imputation metrics.
```{r}
n_genes <- 30
n_samples <- 5
n_reps <- 10
na_location <- replicate(
n = n_reps,
{
chosen_genes <- sample.int(ncol(obj), size = n_genes)
values <- lapply(
chosen_genes, \(x) {
chosen_samples <- sample.int(nrow(obj), size = n_samples)
matrix(c(chosen_samples, rep(x, n_samples)), ncol = 2, dimnames = list(NULL, c("row", "col")))
}
)
do.call(rbind, values)
},
simplify = FALSE
)
na_location[[1]][1:10, ]
```
We proceed to measure the accuracy of the imputation using "good" default parameters for K-NN (`k = 10`) and PCA (`ncp = 5`). For PCA imputation, we leverage the `{mirai}` package for parallelization while the `cores` argument control parallelization for K-NN.
```{r}
tune_knn <- tune_imp(obj, parameters = data.frame(k = 10), rep = na_location, cores = 4)
tune_pca <- tune_imp(obj, parameters = data.frame(ncp = 5), rep = na_location)
```
In this case, PCA imputation performs better. Since the data only needs to be imputed once and the run time is so low, using PCA imputation for this dataset is preferable.
```{r}
rmse_knn <- mean(compute_metrics(tune_knn, metrics = "rmse")$.estimate)
rmse_pca <- mean(compute_metrics(tune_pca, metrics = "rmse")$.estimate)
c("knn" = rmse_knn, "pca" = rmse_pca)
```
## Subset and grouped K-NN and PCA imputation
Some DNA methylation analyses only focus on epigenetic clocks, which require only a subset of CpGs to be imputed. The `knn_imp()` function takes advantage of this to significantly reduce run time. To impute only specific CpGs for K-NN imputation and make use of grouped imputation we can use the `group_features()` function.
Here, we simulate data where only a subset of CpGs are needed for imputation. PCA imputation does not benefit from subset imputation, but does benefit from grouped imputation.
```{r}
# 500 features, 50 samples, 2 chromosomes
sim_obj <- sim_mat(n = 500, m = 50, nchr = 2)
# sim_mat returns matrix with features in rows, so we use t() to put features in columns
obj <- t(sim_obj$input)
# `group_feature` contains metadata for the simulated data
head(sim_obj$group_feature)
n_needed <- 100
# Randomly select CpGs that are needed for clock calculation
subset_cpgs <- sample(colnames(obj), size = n_needed)
```
We define the `features` list-column as a list of character vectors containing features to be imputed. Elements of the `features` list-column are subsets of the `subset_cpgs` character vector. These CpGs belong to either `chr1` or `chr2`, so we split them into two groups, resulting in two rows of the `group_df` tibble. The `aux` list-column now contains the remaining CpGs from either `chr1` or `chr2` that are not imputed themselves but are used to increase the accuracy of imputing the CpGs in the `features` list-column.
```{r}
# Construct the group data.frame with one row per group (i.e., chromosome)
group_df <- group_features(obj, sim_obj$group_feature, subset = subset_cpgs)
group_df
```
For demonstration purposes, we use the `parameters` column to add group-specific parameters. This can be useful if hyperparameters are tuned separately for each group, though this is typically not necessary. When `k` or `ncp` is too large for some groups, we can use the `parameters` column to specify group-specific `k` or `ncp` values. `k` and `ncp` can also be specified in `group_features()`.
```{r}
pca_group <- group_df
# For PCA imputation, we use ncp = 5 for the first group and ncp = 10 for the second group
pca_group$parameters <- list(data.frame(ncp = 5), data.frame(ncp = 10))
# For K-NN imputation, we use k = 5 for both groups
knn_group <- group_df
knn_group$parameters <- list(data.frame(k = 5), data.frame(k = 5))
pca_group
knn_group
```
K-NN subset imputation is very fast.
```{r}
system.time(
knn_results <- group_imp(obj, group = knn_group, cores = 4)
)
system.time(
pca_results <- group_imp(obj, group = pca_group)
)
```
## Sliding Window K-NN and PCA imputation
We simulate the output of the `{methylKit}` package. **Note:** WGBS/EM-seq data should be grouped by chromosome before performing sliding window imputation.
```{r}
sample_names <- paste0("S", 1:10)
positions <- 500
methyl <- tibble::tibble(
chr = "chr1",
start = seq_len(positions),
end = start,
strand = "+"
)
set.seed(1234)
for (i in seq_along(sample_names)) {
methyl[[paste0("numCs", i)]] <- sample.int(100, size = positions, replace = TRUE)
methyl[[paste0("numTs", i)]] <- sample.int(100, size = positions, replace = TRUE)
methyl[[paste0("coverage", i)]] <- methyl[[paste0("numCs", i)]] + methyl[[paste0("numTs", i)]]
}
methyl[1:5, 1:10]
```
First, we convert the data into a beta matrix with features in the columns. **Important:** We sort the `methyl` object by position and name the CpGs by their genomic position. This is important because `slide_imp` imputes the beta matrix column-wise in windows so the order of the column has to be meaningful.
```{r}
methyl <- methyl[order(methyl$start), ] # <---- Important, sort!
numCs_matrix <- as.matrix(methyl[, paste0("numCs", seq_along(sample_names))])
cov_matrix <- as.matrix(methyl[, paste0("coverage", seq_along(sample_names))])
beta_matrix <- numCs_matrix / cov_matrix
colnames(beta_matrix) <- sample_names
rownames(beta_matrix) <- methyl$start
beta_matrix <- t(beta_matrix)
# Set 10% of the data to missing
set.seed(1234)
beta_matrix[sample.int(length(beta_matrix), floor(length(beta_matrix) * 0.1))] <- NA
beta_matrix[1:5, 1:5]
```
This example dataset is small, but in actual analyses with millions of CpGs, we can tune the K-NN or PCA imputation on just the first tens of thousands of CpGs. Here, we perform K-NN imputation by specifying `k`, a window size (`n_feat`) of 100, and a 10-CpG overlap (`n_overlap`) between windows.
```{r}
# For example, we are tuning `k` value here.
slide_knn_params <- tibble::tibble(k = c(10, 20), n_feat = 100, n_overlap = 10)
# Increase rep from 2 in actual analyses
tune_slide_knn <- tune_imp(
obj = beta_matrix,
parameters = slide_knn_params,
cores = 4,
rep = 2
)
compute_metrics(tune_slide_knn)
```
Finally, we can use `slide_imp()` to impute the `beta_matrix`.
```{r}
imputed_beta <- slide_imp(obj = beta_matrix, n_feat = 100, n_overlap = 10, k = 10, cores = 4)
imputed_beta
```