--- title: "Inferring a Level-1 network with the NANUQ+ routines" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Inferring a Level-1 network with the NANUQplus routine} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} bibliography: vigREFS.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>") knitr::opts_chunk$set(fig.align = "center", fig.show = "hold", out.width = "55%", fig.width = 7, fig.height = 6) ``` ```{r setup} library(MSCquartets) ``` ## Why use NANUQ$^+$ routines? NANUQ$^+$ is a collection of routines for inference of level-1 network features under the network multispecies coalescent (NMSC) model. Notably, they do not assume that the entire network is level-1. Taking as input a collection of unrooted topological gene trees, output is an unrooted (semidirected) topological level-1 network that captures species relationships, displaying cycles and cut edges (tree-like network features). Complex blobs (those that are not cycles) may be left as multifurcations of degree greater than 3 (polytomies) in this structure. The initial step infers the unknown network's tree of blobs, using the function `TINNIK` (described in another vignette). This tree partially represents species relationships, showing only cut edges and contracting all reticulation cycles to multifurcations. Multifurcations suspected to arise from cycles can then be resolved into optimal cycle structures using `resolveCycle`. This resolution relies on a least-squares approach, comparing an empirical NANUQ distance among taxon groups around a multifurcation to expected distances for possible cycle configurations. For cycles under a default size of 10, a full search is fast, while larger cycles are processed using a quartet-based heuristic. The user may combine cycle resolutions for some or all of the individual multifurcations with `combineCycle`. Blobs that are not suspected to arise from cycles can be left unresolved. When all multifurcations are believed to be cycle-derived, `resolveLevel1` both resolves all multifurcations and combines the resolutions into fully-resolved level-1 networks. We recommend using the `NANUQ` function before resolving any multifurcations, to assess which ones may have resulted from network cycles. For a level-1 network, `NANUQ` generates a splits graph with distinctive structural features called 'darts'. If the level-1 assumption is violated, this structural pattern is disrupted, providing insight into whether resolving blobs as cycles is warranted. For blobs that are cycles, all procedures here provide statistically consistent estimates of network structure under the NMSC model. ## Preparing the input data The NANUQ$^+$ routines take as input a collection of gene trees for a set of taxa. From multilocus sequence data, standard phylogenetic tools must first be used to infer unrooted topological trees for each gene. Although these trees are not raw data, they are treated as such. If the gene trees are rooted or contain edge lengths, that information is simply ignored in these routines. We work with an example data set of 16,338 gene trees of 16 Leopardus species, supplied by @Lescroart2023 and analyzed with NANUQ$^+$ by @ABRW24. Gene trees in Newick format can be read from a text file and their displayed quartet trees counted with commands such as: ```{r, eval=FALSE} # read text file of gene trees and count quartets on them gts<-read.tree(file = 'genetreefile') tableLeopardusLescroart=quartetTable(gts) ``` We will use a table of this data supplied within *MSCquartets*, pre-computed by the above commands. Both `TINNIK` and `NANUQ` can take as input either the gene tree file or the quartet table. ```{r} # load data file containing quartet counts for Leopardus data set supplied with MSCquartets package # These counts are will be accessed as `tableLeopardusLescroart`. data(tableLeopardusLescroart) ``` #### Notes: All functions used here allow gene trees with missing taxa, provided each subset of four taxa appears on at least one tree (ideally, on many). Statistical tests are conducted for each of these subsets, with the sample size being the number of trees displaying the subset. The good statistical behavior of the analysis below is established under the assumption that the input gene trees are a true sample from the NMSC model. In practice, inferred gene trees, containing some error, are used instead. Poorly inferred gene trees or those lacking sufficient resolution can compromise results. ## Step 1: Inferring the tree of blobs The function `TINNIK` can be used to infer the tree of blobs for an unknown network. This first applies two hypothesis tests for each choice of 4 taxa, testing whether the gene tree counts allow for the rejection of a resolved quartet tree or a star tree for those species. (See `TINNIK`'s vignette or [@ABMR24] for more information.) ```{r} # perform TINNIK analysis for gene trees, using defaults output<-TINNIK(tableLeopardusLescroart) # save table of quartet information with p-values pT<-output$pTable ``` The default `TINNIK` test levels are $\alpha=.05$ for the null hypothesis "the quartet has a tree-like relationship", and $\beta=.95$ for the null hypothesis "the quartet has a star-like relationship". An initial run of `TINNIK` with default test levels is rarely a sufficient analysis. One should always vary the $\alpha$ and $\beta$ to judge robustness of the inferred tree of blobs to their choice. As detailed by @ABRW24, for this data we choose $\alpha=5e-29$, imposing a stringent test for judging non-tree-likeness. Note in the first plot below this gives a clear separation between red (quartets interpreted as 4-blobs) and blue (quartets interpreted as tree-like) symbols, allowing for noise in the gene trees without rejecting tree-like-ness too strongly. For this dataset and test levels we obtained a tree of blobs with two multifurcations, both of degree five. ```{r} # perform improved TINNIK analysis to infer the tree of blobs output<-TINNIK(pT, alpha=5e-29,beta = 0.95) ``` Output produced by `TINNIK` includes the table of quartet counts and associated test p-values and the tree of blobs. We rename these to use as input in further functions. ```{r} # run TINNIK to infer the tree of blobs output<-TINNIK(pT, alpha=5e-29,beta = 0.95) ``` ```{r} # rename output pT<-output$pTable #quartet count data with p-values for tests ToB<-output$ToB #the TINNIK tree of blobs ``` ## Using `NANUQ` to explore whether multifurcations should be resolved into cycles The `NANUQ` function can help detect whether a blob might be a cycle. On it's own, `NANUQ` is a statistically consistent level-1 network inference algorithm under the NMSC model. It takes the same input as TINNIK (a set of gene trees or a quartet table, along with two test levels) and produces, among other outputs, a distance measure that can be used as input to the `neighborNet` algorithm (as implemented in the `phangorn` package) to generate a splits network with specific structural properties. When the model is violated, this structure is disrupted, providing insight into potential violations of the level-1 assumption. Further details on NANUQ and its applications are given by @ABR19 and @RBMA2020. We recommend using the same test levels but encourage testing alternative levels as well. ```{r} # perform NANUQ analysis for table of quartet information and p-values D<-NANUQ(pT, alpha = 5e-29,beta = 0.95) # Run the NANUQ routine NN<-neighborNet(D$dist) # Run the NeighborNet algorithm on the NANUQ distance plot(NN) # plot the splits-graph with neighborNet ``` As in [@ABRW24], we identified one multifurcation that aligns well with the model hypothesis, displaying a characteristic dart-like structure, and another that does not. We cannot determine whether the latter multifurcation results from a model violation (presumably level-1-ness) or simply noise. Nevertheless, this indicates caution is needed in interpreting findings associated with this multifurcation. #### Note: We used the NeighborNet implementation `neighborNet` in the `phangorn` package here, to do the full analysis within R. However, we recommend using SplitsTree [@SplitsTree] for the splits graph, as it has been more thoroughly tested. For more details, see [@RBMA2020]. ## Step 2: Resolving multifurcations When resolving the multifurcations, the first step is to numerically label all internal nodes in the tree of blobs, so that we and the code can refer to individual multifurcations. ```{r} #Label internal nodes of the tree of blobs, and plot ToB<-labelIntNodes(ToB) ``` Here the multifurcations are nodes 18 and 20. We will resolve each one independently using the function `resolveCycle`. While the same test levels as in TINNIK can be used, we suggest experimenting with different levels. ```{r} # resolve node 18 resC18<-resolveCycle(ToB,18,pT,alpha=5e-29,beta=0.95) ``` The output `resC18` is a list that includes a network displaying the best resolution of node 18, with the other multifurcation remaining unresolved. Other information in this list is needed for combining resolutions of different multifurcations, or producing the histogram of residuals for all possible resolutions. ```{r} # resolve node 2O resC20<-resolveCycle(ToB,20,pT,alpha=5e-29,beta=0.95) ``` Note multifurcation 20 produced five optimal resolutions (that is, those 5 resolutions have residual sums of squares that are within a small tolerance). This suggests that this multifurcation may not arise from a cycle, or that the data has too much noise to be sure. Alternatively, if one suspects that all multifurcations arise from a cycle, one could attempt to resolve all multifurcations to obtain a level-1 network. ```{r} #Fully resolve the tree of blobs to a level-1 network resN<-resolveLevel1(ToB=output$ToB, pTable=output$pTable, alpha=5e-29, beta=0.95, distance="NANUQ") ``` The output `resN` includes, as its first element, a list of the 5 level-1 networks that accommodate all the optimal resolutions that can co-occur. #### Note: In reporting any NANUQ$^+$ analysis, it is essential to report the test levels used, and, as mentioned throughout, we strongly recommend exploring with values beyond the defaults. ## Estimating numerical parameters and further comparing multiple optimal resolutions NANUQ$^+$ yielded 5 fully-resolved topological networks for this data analysis. One can use the Julia package *PhyloNetworks* [@PhyloNetworks] to optimize numerical parameters (edge lengths and hybridization parameters) on these under a quartet pseudo-likelihood criterion, and also use the pseudo-likelihood score to further compare these networks. Note, however, that comparing networks by psuedo-likelihood is a compromise due to the difficulty of computing full likelihoods. The commands below (based on 2024 *PhyloNetworks*) provide an example of this. Note that this is Julia code and not R code, and it requires additional installation of software. ````{verbatim} using PhyloNetworks #Load package gts = readMultiTopology("genetreefile") #load gene trees to Julia q, t = countquartetsintrees(gts; showprogressbar = true) #compute the quartet counts global df = writeTableCF(q, t) # to get a DataFrame that can be saved to a file later global empCFs = readTableCF(df) ntwk = readMultiTopology("Network.nwk", false) # read the topology of the newtwok to be optimized net = topologyMaxQPseudolik!(ntwk,empCFs) # optimize network parameters to obtain the pseudo-likelihood score ```` #### Note: Inferring numerical parameters is only suggested for networks that are fully resolved, such as those from the function `resolveLevel1`. Attempted optimization on networks that are partially resolved, where some multifurcations represent unknown blob structures, may lead to erroneous inference. ## References