--- title: Demonstrate Pv3Rs usage output: rmarkdown::html_vignette description: Demonstrates Pv3Rs workflow for a single study participant vignette: > %\VignetteIndexEntry{Demonstrate Pv3Rs usage} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align ='center', fig.width = 7, cache = FALSE ) ``` ```{r setup, echo = FALSE} library(Pv3Rs) ``` This vignette demonstrates the basic `Pv3Rs` workflow for a single study participant: 1) Plot data 2) Compute recurrence state posterior probabilities 3) Plot per-recurrence probabilities We also show how to explore relationship graphs and their log likelihoods. For a general understanding of posterior probabilities, see [Understand posterior probabilities](https://aimeertaylor.github.io/Pv3Rs/articles/posterior-probabilities.html). A tutorial demonstrating the `Pv3Rs` workflow at the study-level is planned; we plan to cover - Generating pairwise probabilities when the total genotype count over multiple recurrences exceeds eight - Computing false discovery rates - Sensitivity analysis for genotyping errors - Sensitivity analysis for sibling misspecification - Alternative approach using genetic distance and whole-genome sequence data - Using posterior probabilities (rather than categorical classifications) in downstream analyses - Estimating false negative rates ## Basic workflow We begin with a synthetic example of three episodes (episode names are optional) with three markers (marker names are obligatory) whose alleles have known frequencies, `fs`. ```{r} y <- list("Enrolment" = list(m1 = c('b','c','d'), m2 = c('a','b'), m3 = c('b','c','d')), "Recurrence 1" = list(m1 = c('b','d'), m2 = c('a'), m3 = c('a','b')), "Recurrence 2" = list(m1 = c('d'), m2 = c('a'), m3 = c('a'))) fs <- list(m1 = c(a = 0.27, b = 0.35, c = 0.18, d = 0.20), m2 = c(a = 0.78, b = 0.14, c = 0.07, d = 0.01), m3 = c(a = 0.21, b = 0.45, c = 0.26, d = 0.08)) ``` ### 1) Plot data We plot the data using `plot_data()`. ```{r, fig.align='center', fig.height=6} plot_data(ys = list("Participant data" = y), fs = fs) ``` For each marker, different colours represent different alleles. The legend below the main grid shows per-marker allele frequencies via colour proportions, with one row per marker, ordered as in the main grid; e.g., `d` at `m2` is rare. The most parsimonious MOI estimates compatible with the above data, `r determine_MOIs(y)`, are computed using `determine_MOIs()`. The total genotype count (sum over MOIs) is `r sum(determine_MOIs(y))`. **Aside** In this synthetic example, markers are quart-allelic, imposing low upper bounds on MOI estimates based on maximum per-marker allele counts. More diverse markers are recommended for MOI estimation and recurrence state inference. ### 2) Compute recurrence state posterior probabilities The bulk of the computational time lies in computing log-likelihoods of graphs of relationships between genotypes. The number of graphs depends on the MOIs. By default, `compute_posterior` uses MOI estimates generated by `determine_MOIs`. ```{r, cache=TRUE} post <- compute_posterior(y, fs) ``` **Aside** We do not recommend running `compute_posterior()` for data whose total genotype count (sum over MOIs) exceeds eight. That said, we have not encoded a hard limit. In our experience, it is possible, but very long, to generate posterior probabilities using data with a total genotype count up to 10; above 10, calls to `compute_posterior()` are liable cause memory-use problems and fail. In the call to `compute_posterior()` above we did not specify a prior, and so by default all three recurrence states were assumed equally likely per recurrence. Posterior probabilities of recurrent state sequences (where `C` is recrudescence, `L` is relapse, and `I` is reinfection) are stored in `post$joint`. Here, we find the most likely sequence of recurrence states is `r names(which.max(post$joint))` with posterior probability `r round(post$joint[which.max(post$joint)],4)`: ```{r} sort(post$joint, decreasing = T) ``` Per-recurrence posterior probabilities of recrudescence `C`, relapse `L`, and reinfection `I` are stored in `post$marg`. ```{r} post$marg ``` We refer to per-recurrence probabilities as `marg` (shorthand for marginal) because they are computed by simply summing over state sequence probabilities. For example, the probability of `L` at `Recurrence 1` above is the sum of probabilities over `LC`, `LL` and `LI`: ```{r} post$joint["LC"] + post$joint["LL"] + post$joint["LI"] ``` ### 3) Plot per-recurrence probabilities Per-recurrence posterior probabilities are plotted on the simplex using `plot_simplex()`. ```{r} oldpar <- par(no.readonly = TRUE) par(mar = c(0,0,0,0)) plot_simplex(p.coords = rbind(post$marg, Prior = rep(1/3, 3)), pch = 20) par(oldpar) ``` The point in the yellow region is most likely a recrudescence with posterior probability greater than 0.5 (it falls in the bright yellow region); the point in the red region is most likely a reinfection with posterior probability greater than 0.5 (it falls in the bright red region). ## Exploration of relationship graphs To explore relationship graphs (RGs) and their log-likelihoods, set `return.RG` and `return.logp` to `TRUE`. ```{r} post <- compute_posterior(y, fs, return.RG = TRUE, return.logp = TRUE) ``` We recover the same posterior as before. ```{r} sort(post$joint, decreasing = T) ``` But the compute time was longer: to compute the posterior, summations over per-marker allelic assignments that are equivalent up to within-episode genotype permutations are redundant. As such, by default, `compute_posterior()` does not sum over them, conserving both memory and compute time. The exploitation of permutation symmetry requires a scheme to choose a single representative among permutations that are otherwise equivalent. To compute meaningful graph likelihood values (values that do not depend on the representative-choosing scheme), all permutations are summed over when `return.logp = TRUE`, increasing compute time, especially when MOIs are large. Also, when user-specified MOIs exceed those of `determine_MOIs()`, all permutations are summed over because the representative-choosing scheme is too complicated. The log-likelihood of each relationship graph is returned. We plot the relationship graph(s) with the largest likelihood. In this example, there are two graphs with maximum likelihood; they are isomorphic up to within-episode genotype permutations. ```{r} # Extract all log likelihoods llikes <- sapply(post$RGs, function(RG) RG$logp) # Get maximum log likelihood mllikes <- max(llikes) # Extract the relationship graphs (RGs) with the maximum log likelihood RGs <- post$RGs[which(abs(llikes - mllikes) < .Machine$double.eps^0.5)] # Plot RGs with maximum log likelihoods oldpar <- par(no.readonly = TRUE) # Store user's options par(mar = rep(0.1,4), mfrow = c(1,2)) for(i in 1:length(RGs)) { plot_RG(RG_to_igraph(RGs[[i]], determine_MOIs(y)), vertex.size = 20) box() } # Add a legend legend("bottomright", pch = 21, pt.bg = RColorBrewer::brewer.pal(n = 8, "Set2") [1:length(y)], bty = "n", legend = names(y), title = "Episode") par(oldpar) # Restore user's options ``` Using log likelihoods, we can also find the equivalence class for which the data are most probable when all relationship graphs in that class are summed over. ```{r} # In the following code, we place two graphs in the same equivalence class if # they share the same likelihood. This is not ideal (two graphs that are not # isomorphic up to permutation could share the same likelihood), but it works # here: the plot shows only isomorphic graphs within the equivalence class. sorted_llikes <- sort(llikes, decreasing = T) # Sort log likelihoods adj_equal <- abs(diff(sorted_llikes, lag = 1)) < .Machine$double.eps^0.5 # Find matches decr_idxs <- which(adj_equal == FALSE) # Change points: 2, 8, 14, 20, 32, ... class_sizes <- c(decr_idxs[1], diff(decr_idxs)) # Number of graphs per class # log likelihood of representative from each 'equivalence class' (EC) llikes_unique <- sorted_llikes[decr_idxs] # EC likelihood class_ps <- exp(llikes_unique)*class_sizes max_class_p <- which(class_ps == max(class_ps)) # ML EC index max_idx <- decr_idxs[max_class_p] # Index of last graph in ML EC max_size <- class_sizes[max_class_p] # Number of graphs in ML EC # Plot all graphs within the ML EC oldpar <- par(no.readonly = TRUE) # Store user's options par(mar = rep(0.1,4), mfrow = c(3,4)) RG_order <- order(llikes, decreasing = T) # order RGs by logl for(i in (max_idx-max_size+1):max_idx) { # EC consists of the RGs with logl rank 21-32 RG <- post$RGs[[RG_order[i]]] RG_igraph <- RG_to_igraph(RG, determine_MOIs(y)) plot_RG(RG_igraph, vertex.size = 25, vertex.label = NA) box() } par(oldpar) # Restore user's options ``` **Important considerations** - The maximum-likelihood graph(s) might not be in the equivalence class for which the data are most probable when all relationship graphs in that class are summed over (true of the above example). - The maximum-likelihood graph(s) might be incompatible with the maximum-posterior state sequence, (not true of the above example; recall the most-likely sequence was `r names(which.max(post$joint))`). - The graphs in the maximum-likelihood equivalence class might be incompatible with the maximum-posterior state sequence (true of the above example; the class with the largest likelihood contains graphs with sibling edges that are incompatible with `r names(which.max(post$joint))`).