--- title: "Optimization of variant calling" author: "Miguel Camacho-Sanchez" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{tidyGenR-optimization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r options, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The function `variant_call()` initiates a workflow of multiple functions from *DADA2.* Initially, _DADA2_ was developed for denoising metabarcoding reads from environmental samples where a high number of ASVs from different species are expected. Thus, default *DADA2* parameters aim to maximize true positives while minimizing false positives (artifacts) and false negatives (true variants classified as artifacts) in metagenomic samples (Callahan et al., 2016; Rosen et al., 2012). However, *tidyGenR* will likely be used in many cases for determining amplicon variants in diploid species, as in vertebrates. For diploids, a maximum of two variants per locus are expected. This scenario justifies exploring how "child" variants are classified by _DADA2_. `birth_pval` is the probability of variant being classified in its parent group given the error model. When `birth_pval` \< `OMEGA_A` the child variant is promoted to its own cluster. The pior knowledge of a maximum of 2 alleles per locus for diploids can be used to explore `OMEGA_A` and read frequency thresholds. These can be edited to maximize the number of "true" variants (true positives) while minimizing the risk of calling artifacts as real alleles. ```{r setup} library(tidyGenR) ``` # explore_dada() The function `explore_dada()` permits to run *DADA2* under testing parameters. It is recommended to run it with a high value of 'OMEGA_A'. Under such a high threshold most variants will have a lower `birth_pval` and will form their own cluster. This allows the user to visually determine what thresholds will work best for their given dataset. ```{r, results = 'hide'} fq <- list.files(system.file("extdata", "truncated", package = "tidyGenR"), pattern = "F_filt.fastq.gz", full.names = TRUE) ex_d <- explore_dada(fq, value_na = 10, reduced = TRUE, pool = FALSE, vline = 2, hline_fr = 0.1, omega_a = 0.9, band_size = 16 ) ``` ```{r} ex_d$p3 ``` In this case, the minimum `log(-log(birth_pval))` for the less frequent allele is around 4, meaning that a threshold of `log(-log(birth_pval)) < 4` would be necessary to recover all variants in the sequence dataset. A `log(-log(birth_pval))` of 4 implies a `birth_pval` of approximately\ $\exp(-\exp(4)) \approx 1.9 \times 10^{-24}$,\ indicating an extremely low probability that such a variant arises as an error from its parent sequence under the *DADA2* error model. # compare_calls() Variant calls with different parameters can be compared with `compare_calls()`. For this example, two runs with different thresholds of `OMEGA_A` are compared. From the `explore_dada()` results above, it can be concluded that a `log(-log(birth_pval))` of 6 will leave out some variants for `chrna9` and `nfkbia`. ```{r, results='hide'} # path to fastq truncated <- system.file("extdata", "truncated", package = "tidyGenR") # variant calling omega_a_test <- c(relaxed = 4, restrictive = 6) variants_called <- omega_a_test |> lapply(function(x) { variant_call(in_dir = truncated, omega_a_f = exp(-exp(x)), band_size = 16, maf = 0.1 ) }) # compare calls comp_call <- compare_calls(variants_called) ``` ```{r} comp_call$plot2 ``` It can be appreciated that in the restrictive call some variants are missing compared to the relaxed calling. Other outputs are produced. See `str(comp_call)` and `?compare_calls()` for more information.