--- title: "About the quartet distance" author: "Martin R. Smith" date: "`r Sys.Date()`" output: bookdown::pdf_document2: includes: in_header: ../inst/preamble.tex rmarkdown::html_document: bibliography: ../inst/REFERENCES.bib csl: ../inst/apa-old-doi-prefix.csl vignette: > %\VignetteIndexEntry{About the quartet distance} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r Load package, include=FALSE} require('Quartet') require('ape') # for plot.phylo ``` # Partition distances The Robinson-Foulds (RF or 'partition') metric [@Robinson1981; @Steel1993] measures the symmetric difference between two trees by adding the number of splits (i.e. groupings) that are present in tree A (but not tree B) to the number of splits present in tree B (but not tree A). It is most useful when the trees to be compared are very similar; it has a low range of integer values, and a low maximum value, limiting its ability to distinguish between trees [@Steel1993]; and it treats all splits as equivalent, even though some are more informative than others. Various other artefacts and biases limit its performance against a suite of real-world benchmarks [@Smith2020;@Smith2022]. These shortcomings can largely be mitigated through generalizations of the Robinson-Foulds distance (see R package '[TreeDist](https://ms609.github.io/TreeDist/)'), but the complementary perspective on tree similarity offered by a quartet-centred approach can be illuminating. # Quartet distances Instead of partitions, symmetric differences can be measured by counting the number of four-taxon statements (quartets) that differ between two trees [@Estabrook1985;@Day1986]. For any four tips A, B, C and D, a split on a bifurcating tree will separate tip A and either B, C or D from the other two tips. That is to say, removing all other tips from the tree will leave one of these three trees: ```{R Three-four-taxon-trees, echo=FALSE, cache=TRUE, fig.width=6, fig.asp=1/3, out.width='66%', fig.align='center'} par(mfrow = c(1, 3), mar = c(0.5, 1, 0.5, 1), cex = 1) plot(ape::read.tree(text = '((A, B), (C, D));'), tip.color = Ternary::cbPalette8[c(1, 4, 7, 5)], font = 2) plot(ape::read.tree(text = '((A, C), (B, D));'), tip.color = Ternary::cbPalette8[c(1, 7, 4, 5)], font = 2) plot(ape::read.tree(text = '((A, D), (C, B));'), tip.color = Ternary::cbPalette8[c(1, 5, 7, 4)], font = 2) ``` Thus two of the random trees below share the quartet `(A, B), (C, D)`, whereas the third does not; these four tips are divided into `(A, D), (B, C)`. ```{R Plot-a-quartet, echo=FALSE, cache=TRUE, fig.asp=1.3/3, fig.width=6, out.width='80%', fig.align='center'} par(mfrow = c(1, 3)) suppressWarnings(RNGversion("3.5.0")) # Stopgap until R 3.6.0 is widely available set.seed(7) trees7 <- lapply(logical(3), function (X) { tr <- ape::rtree(7, br = NULL) tr$edge.length <- rep(1, 12) tr$tip.label <- LETTERS[1:7] tr }) Quartet::PlotQuartet(trees7, LETTERS[1:4], cex = 1.4, font = 2) ``` There are $\binom{n}{4}$ groups of four taxa in a tree with $n$ tips; for each of these groups, one of the three trees above will be consistent with a given tree. As such, two identical trees will have a quartet distance of 0, and a random pair of trees will have an expected $\binom{n}{4} / 3$ quartets in common. Because quartets are not independent of one another, no pair of trees with six or more tips can have all $\binom{n}{4}$ quartets in common [@Steel1993]. Properties of the quartet distance are explored fully in Steel & Penny [-@Steel1993]. As quartet distances of 1 can only be accomplished for small trees (five or fewer leaves; see below), it is perhaps more appropriate to consider whether or not trees are more dissimilar than a pair of random trees, whose distance will be, on average, $\frac{2}{3}$. (Data from real trees, and comparisons with expected values of other metrics, are available (here)[https://ms609.github.io/TreeDistData/articles/09-expected-similarity.html].) # Normalization Whereas counting quartets is simple, accounting for resolution is not. Two trees will have few quartet statements in common if they are well resolved and differ in many details; or if they are poorly resolved but in perfect agreement. As such, it is important to normalize quartet distances in a meaningful fashion. A number of normalizations have been proposed [@Estabrook1985;@Day1986]; arguably the most appropriate is the Symmetric Quartet Divergence [@Smith2019], which represents the total number of quartets unique to each tree normalized against the total number of quartets that could have been resolved. The `SimilarityMetrics()` documentation page gives further details. # Asymmetric differences Metric distances are necessarily symmetric -- that is, the distance from tree A to tree B equals the distance from B to A. This behaviour is not necessarily desirable when one tree represents a known 'reference' -- such as a tree validated by independent data, or a tree used to simulate data in order to test phylogenetic reconstruction techniques. In such cases, a tree might be evaluated according to the likelihood that a randomly chosen quartet is resolved correctly by the tree, where an uncertain resolution in either the reference or comparison tree is taken as having a 1/3 chance of being correct (Asher & Smith forthcoming). More details are given at the `SimilarityMetrics()` documentation page. ## Quartet similarity in a pair of random trees On average, $\frac{1}{3}$ of the quartets resolved in a pair of random trees will match. This is because there are three quartets involving any set of four tips, each of which is equally likely to occur on a truly random tree. The below code calculates the mean proportion of matching quartets between 10 random trees (90 pairs) with 4 to 20 leaves, and the corresponding standard deviation. ```{r mean-proportion} round(vapply(4:20, function (nTip) { trees <- lapply(rep(nTip, 10), TreeTools::RandomTree) s <- ManyToManyQuartetAgreement(trees)[, , 's'] results <- s[lower.tri(s)] / choose(nTip, 4) c(mean(results), sd(results)) }, c(mean = 0, sd = 0)), 3) ``` # Independence One possible criticism of the quartet distance is that not all individual quartet statements are independent. For example, the quartet statements `AB | CD` and `AB | CE` together imply `AB | DE`. A simple count of identical quartets therefore includes some redundant information. This prevents a straightforward information theoretic interpretation of the quartet distance. # Minimum quartet similarity As a related phenomenon, when there are six or more tips in a bifurcating tree, some quartets are necessarily shared between trees. Consider the tree: ```{r example-tree} tree_a <- ape::read.tree(text = "((1, 2), (3, (4, 5)));") ``` ```{r plot-trees, fig.height=1.6, fig.width=2, echo=FALSE} par(mar = rep(0.3, 4)) plot(tree_a) ``` The only trees with no quartets in common with Tree A are symmetric with ```{r none-in-common} tree_b <- ape::read.tree(text = "((1, 5), (3, (2, 4)));") ``` ```{r plot-tree-b, fig.height=1.6, fig.width=2, echo=FALSE} par(mar = rep(0.3, 4)) plot(tree_b) ``` Now create Tree C by adding a 6th tip as a sister to tip `3` on Tree A. ```{r tree-c} tree_c <- ape::read.tree(text="((1, 2), ((3, 6), (4, 5)));") ``` ```{r Add-tip-6-to-Tree-C, fig.height=1.6, fig.width=2, echo=FALSE} par(mar = rep(0.3, 4)) plot(tree_c, tip.color = c(1,1,1,2,1,1)) ``` There's nowhere to add tip `6` to Tree B without creating a quartet that exists in Tree C. ```{r Adding-tip-6-to-Tree-B-duplicates-a-quartet, fig.height=5, fig.width=2.5, echo=FALSE} PlotApeTree <- function (text, quartet) { orig <- TreeTools::RenumberTips(tree_c, as.character(1:6)) tree <- ape::read.tree(text = text) PlotQuartet(list(orig, TreeTools::RenumberTips(tree, as.character(1:6))), quartet, overwritePar = FALSE, cex = 0.9) } par(mfrow = c(7, 2), mar = rep(0.4, 4), cex = 0.9) PlotApeTree("(((1, 6), 5), (3, (2, 4)));", c(1, 6, 4, 5)) PlotApeTree("((1, 5), (3, ((2, 6), 4)));", c(2, 6, 4, 5)) PlotApeTree("((1, 5), ((3, 6), (2, 4)));", c(3, 6, 4, 5)) PlotApeTree("((1, 5), (3, (2, (4, 6))));", c(4, 6, 1, 2)) PlotApeTree("((1, (5, 6)), (3, (2, 4)));", c(5, 6, 1, 2)) PlotApeTree("(((1, 5), 6), (3, (2, 4)));", c(1, 5, 3, 6)) PlotApeTree("((1, 5), (3, ((2, 4), 6)));", c(4, 2, 3, 6)) ``` As such, the minimum possible quartet similarity is non-zero, and becomes increasingly difficult to compute as the number of leaves rises. This fact increases the value of comparing low quartet similarity scores to the expected similarity of a pair of random trees (i.e. $\frac{1}{3}$), rather than to zero. # References