--- title: "Comparing sets of trees from different analyses" author: "[Martin R. Smith](https://smithlabdurham.github.io/)" output: rmarkdown::html_vignette: fig_width: 6 fig_height: 4 bibliography: ../inst/REFERENCES.bib csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-old-doi-prefix.csl vignette: > %\VignetteIndexEntry{Comparing sets of trees from different analyses} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- A common application of tree space analysis is to compare the outputs of different analyses – for instance, trees obtained from different gene sequences, or results obtained using different models or methods (e.g. Bayesian, maximum likelihood, or parsimony). ## Shiny app This can be accomplished quickly using the `MapTrees()` graphical user interface: - Load trees from file: Select first tree file - Select an appropriate sample size - Select Replace existing - Load each additional set of trees from file using Add batch to existing ![Load tree batches](Batch1.svg){ width="245px" } - On the Display tab, select Point symbols: One per batch, or Colour points by: Batch ![Tree batch styles](Batch2.svg){ width=100% } ## Scripting at the R command line More control over the mapping can be obtained at the command line: ```{r generate-trees} # Load trees library("TreeTools", quietly = TRUE) batch1 <- as.phylo(1:60, 8) # Generate 60 similar trees batch2 <- as.phylo(seq(200, 800, length.out = 30), 8) # A separate batch of 30 trees styles <- c(1, 2) # Select plotting colours / symbols treeStyle <- rep(styles, c(length(batch1), length(batch2))) # Calculate distances library("TreeDist") distances <- ClusteringInfoDistance(c(batch1, batch2)) # Construct over-simple 2D PCoA mapping mapping <- cmdscale(distances, k = 2) ``` ```{r plot-mapping, fig.align = "center"} # Plot mapping par(mar = rep(0, 4)) plot(mapping, asp = 1, # Preserve aspect ratio - do not distort distances ann = FALSE, axes = FALSE, # Don't label axes: dimensions are meaningless col = treeStyle, # Colour pch = treeStyle # Plotting symbol ) legend("left", c("Batch 1", "Batch 2"), col = styles, pch = styles) ``` For more robust analyses than the (potentially misleading!) 2D plot above, consult the companion [vignette](treespace.html). Note also that mapped areas and their regions of overlap may not correspond to reality; see the warnings and recommendations in @SmithSpace. ## Comparing trees' dispersal / hypervolume Interpreting and comparing the areas of tree space from a projection can be misleading -- the expanded apparent area of Greenland under the Mercator projection being a familiar example. ![Mapping can introduce distortion](https://i0.wp.com/geoffboeing.com/wp-content/uploads/2015/08/greenland-vs-africa-mercator-projection.jpg) As such, it is always best to work with original distances when interpreting whether sets of trees occupy larger or smaller regions of tree space. ### Distances from median One approach is to plot distances from a median tree: ```{r compare-dist-from-median, fig.align = "center"} # Calculate median trees median1 <- median(batch1) median2 <- median(batch2) # Compute distance from each tree to the median of its batch dist1 <- ClusteringInfoDist(batch1, median1) dist2 <- ClusteringInfoDist(batch2, median2) # Set resolution of histogram nBreaks <- 10 breaks <- seq(0, max(dist1, dist2), length.out = nBreaks) # Plot first distance set hist(dist1, col = "#00000022", breaks = breaks, main = "Distance from median of batch", xlab = "Clustering information distance", ylim = c(0, 25) # Omit this line to infer Y axis limit from first batch. ) # Add second distance set hist(dist2, col = "#ff000022", breaks = breaks, add = TRUE) # Add legend legend("topleft", c("Batch 1", "Batch 2"), fill = c("#00000022", "#ff000022")) ``` In the plotted example, distances to the median tree are greater for batch 2 than batch 1, indicating a more dispersed set of trees that occupies a greater hypervolume. Note that the increased frequency at higher distances is expected: the outer shell of a sphere contains more volume than a layer of equivalent thickness closer to the centre, and this phenomenon becomes more pronounced as the dimensionality of tree space increases. ### Consensus resolution A complementary approach is to identify the resolution of the consensus of each batch of trees. This approach shares many of the [problems with the Robinson--Foulds distance]( https://ms609.github.io/TreeDist/articles/Robinson-Foulds.html): in particular, resolution can be decimated by a single "rogue" taxon whose position is poorly defined [@SmithRogue]. Detecting and removing rogue taxa can provide a more meaningful point of comparison. ```{r rogue-detection} # Create tree set with a rogue taxon batch3 <- AddTipEverywhere(as.phylo(7, 7), "t8") # Set up plotting area par(mfrow = c(2, 2), mar = rep(0.4, 4)) # Plot naive strict consensus plot(consensus(batch1, p = 1)) plot(consensus(batch3, p = 1)) if (requireNamespace("Rogue", quietly = TRUE)) { cons1 <- ConsensusWithout(batch1, p = 1, Rogue::QuickRogue(batch1, p = 1)[-1, "taxon"]) cons3 <- ConsensusWithout(batch3, p = 1, Rogue::QuickRogue(batch3, p = 1)[-1, "taxon"]) # The information content of each tree gives a measure of its resolution, # accounting for omitted rogue leaves SplitwiseInfo(cons1) # 8.5 bits SplitwiseInfo(cons3) # 15.1 bits: higher resolution indicates that these # trees are more similar, notwithstanding rogue taxa. # Plot the trees plot(cons1) plot(cons3) } else { message("The package 'Rogue' is required to run this example.") } ``` Whereas a direct interpretation of this analysis is not straightforward, it can provide a complementary way of understanding the distribution of trees across tree space. ## References