--- title: "Registering data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Registering data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This article will show users how to register data using different types of input, as illustrated below. ```{r reg-data, echo=FALSE, fig.align='center', out.width='100%'} knitr::include_graphics("figures/registration_opt_diagram.png") ``` ## Using data frame input ### Loading sample data `greatR` provides an example of data frame containing two different species *A. thaliana* and *B. rapa* with two and three different replicates, respectively. This data frame can be read as follows: ```{r load-greatR, message=FALSE} # Load the package library(greatR) library(data.table) ``` ```{r load-brapa-data, message=FALSE, warning=FALSE} # Load a data frame from the sample data b_rapa_data <- system.file("extdata/brapa_arabidopsis_data.csv", package = "greatR") |> data.table::fread() ``` Note that the data has all of five columns required by the package: ```{r brapa-data-kable-clean, eval=FALSE} b_rapa_data |> knitr::kable() ``` ```{r brapa-data-kable, echo=FALSE} b_rapa_data[, .SD[1:6], by = accession][, .(gene_id, accession, timepoint, expression_value, replicate)] |> knitr::kable() ``` ### Registering the data To align gene expression time-course between *Arabidopsis* Col-0 and *B. rapa* Ro18, we can use the function `register()`. When using the default `use_optimisation = TRUE`, `greatR` will find the best stretch and shift parameters through optimisation. For more details on the other function arguments, go to `register()`. ```{r register-data-raw, message=FALSE, warning=FALSE, eval=FALSE} registration_results <- register( b_rapa_data, reference = "Ro18", query = "Col0", scaling_method = "z-score" ) #> ── Validating input data ──────────────────────────────────────────────────────── #> ℹ Will process 10 genes. #> ℹ Using estimated standard deviation, as no `exp_sd` was provided. #> ℹ Using `scaling_method` = "z-score". #> #> ── Starting registration with optimisation ────────────────────────────────────── #> ℹ Using L-BFGS-B optimisation method. #> ℹ Using computed stretches and shifts search space limits. #> ℹ Using `overlapping_percent` = 50% as a registration criterion. #> ✔ Optimising registration parameters for genes (10/10) [2s] ``` When dealing with thousands of genes, users may speed up the registration process by using the argument `num_cores` to run the registration using parallel processing. ```{r register-data-raw-detect-cores, message=FALSE, warning=FALSE, eval=FALSE} parallel::detectCores() #> 8 ``` ```{r register-data-raw-parallel, message=FALSE, warning=FALSE, eval=FALSE} registration_results <- register( b_rapa_data, reference = "Ro18", query = "Col0", scaling_method = "z-score", num_cores = 8 ) ``` ### Registration results The function `register()` returns a list with S3 class `res_greatR` containing three different objects: - `data` is a data frame containing the expression data and an additional `timepoint_reg` column which is a result of registered time points by applying the registration parameters to the query data. - `model_comparison` is a data frame containing (a) the optimal stretch and shift for each `gene_id` and (b) the difference between Bayesian Information Criterion for the separate model and for the combined model (`BIC_diff`) after applying optimal registration parameters for each gene. If the value of `BIC_diff < 0`, then expression dynamics between reference and query data can be registered (`registered = TRUE`). (Default S3 print). - `fun_args` is a list of arguments used when calling the function (`reference`, `query`, `scaling_method`, ...). To check whether a gene is registered or not, we can get the summary results by accessing the `model_comparison` table from the registration result. ```{r register-data, message=FALSE, warning=FALSE, include=FALSE} # Load a data frame from the sample data registration_results <- system.file("extdata/brapa_arabidopsis_registration.rds", package = "greatR") |> readRDS() ``` ```{r get-model-summary-data, warning=FALSE} registration_results$model_comparison |> knitr::kable() ``` From the sample data above, we can see that for nine out of ten genes, `registered = TRUE`, meaning that reference and query data between those nine genes can be aligned or registered. These data frame outputs can further be summarised and visualised; see the documentation on the [processing registration results](https://ruthkr.github.io/greatR/articles/process-results.html) article. ## Using other inputs ### Loading sample data As noted in the [input data requirements](https://ruthkr.github.io/greatR/articles/data-requirement.html) article, `register()` also accepts a list of data frames or a list of reference and query vectors as `input`: ```{r load-input-vector, message=FALSE, warning=FALSE} # Define expression value vectors ref_expressions <- c(1.9, 3.1, 7.8, 31.6, 33.7, 31.5, 131.4, 107.5, 116.7, 112.5, 109.7, 57.4, 50.9) query_expressions <- c(14, 12.1, 15.9, 47, 30.9, 50.5, 80.1, 67.4, 72.9, 61.7) list_vector <- list( reference = ref_expressions, query = query_expressions ) ``` ### Registering the data As previously shown, to register the input list, users can run the function `register()`: ```{r register-data-list-vectors, message=FALSE, warning=FALSE} registration_results_vectors <- register( list_vector, reference = "Ref", query = "Query", scaling_method = "z-score" ) #> ── Validating input data ─────────────────────────────────────────────────────── #> ℹ Will process 1 gene. #> ℹ Using estimated standard deviation, as no `exp_sd` was provided. #> ℹ Using `scaling_method` = "z-score". #> #> ── Starting registration with optimisation ───────────────────────────────────── #> ℹ Using L-BFGS-B optimisation method. #> ℹ Using computed stretches and shifts search space limits. #> ℹ Using `overlapping_percent` = 50% as a registration criterion. #> ✔ Optimising registration parameters for genes (1/1) [170ms] ``` ### Registration results The registration result object will have the same structure as when using a data frame as an input. Since no ID is explicitly defined in the input vector list, a unique `gene_id` will be automatically generated for the reference and query pair. ```{r get-model-summary-data-vectors, warning=FALSE} registration_results_vectors$model_comparison |> knitr::kable() ```