Often in biology applications we are interested not only in inferring the optimal tree under some evolutionary model, but also in comparing two different trees for the same set of taxa (or even two different trees for two overlapping but not necessarily identical sets of taxa), it is natural to ask a question about the the dissimilarity measures between trees. In general, the question of defining a metric over the space of phylogenetic trees is quite rich and interesting, and we will explore this in more detail over several posts. However, it is useful to start with the simplest case in which we compare two unrooted trees without branch lengths over a fixed set of taxa. While multiple methods have been proposed to address this question, the metric proposed by Robinson and Foulds, 1981 remains a popular choice in practice (the paper has more than 2.6k citations according to Google Scholar), and more importantly for us serves as a great starting point for the exploration of the metrics on tree spaces.
Before we dive into the details of Robinson-Foulds metric it might be useful to briefly review some of the general metric space concepts and definitions.
Review: Metric spaces
A metric space is an ordered pair where is a set and is a function satisfying
for any . Recall from the previous post that in the case when the third condition (called triangle inequality) is replaced by a stricter condition we obtain an ultrametric space.
Note that given any set we can create a simple metric on it by setting if and otherwise. This metric is called the discrete metric (and it is trivial to check that discrete metric is also an ultrametric). Discrete metric provides us with an important realization that any set can be turned into a metric space. However, it is clear that discrete metric is not useful in a practical comparison scenario.
We also note that a metric on a set gives rise to topology, so any metric space is also a topological space. The base for topology on a metric space is defined by the open balls .
In case when the underlying set is finite, we call the resulting metric space a finite metric space. Note that if we consider trees without branch lengths then the resulting set is finite, and hence the corresponding metric space is a finite metric space. This is not the case when we consider the branch lengths, but we will leave the discussion of the geometry of phylogenetic trees with branch lengths to later posts.
Given a set of taxa let us define the set of all unrooted (not necessarily bifurcating) trees with the property that all nodes of degree 1 or 2 are labeled by pairwise disjoint non-empty subsets of as . Below are a few examples of trees in for the set of taxa . Unlabeled nodes are considered to have as their label.
Conversely neither of the trees shown below are in , since they violate the labeling properties.
Note, that any unrooted phylogenetic tree on will be an element of , but not all trees in are fully resolved unrooted phylogenies.
Now, we will define two operations on named contraction (denoted by ) and decontraction (denoted ).
Contraction will consist of an edge contraction together with the union operation on the corresponding node label sets. More precisely given a tree with the labeling defined by and an edge we will define the contraction by considering and adding a vertex with and edges . Thus, we have . The figure below illustrates an example of subsequent contractions on tree in for .
Decontraction will consist of an “inverse” operation in which we will split a node by creating an edge. Note that the caveat here is that such an operation depends not only on the choice of the node to split, but also on the choice of a label set and incident edges set split. Hence, the use of notation is misleading since applying a decontraction after a contraction does not necessarily result in the original tree. Hence, to define a decontraction precisely we will define it as a function . Let where and . Let where and . Now, we can construct the tree and add two vertices to it alongside the edge and redistribute the edges previously incident on . Formally, with the labeling function defined as and for the new nodes. The figure below illustrates two of the possible ways a decontraction can be applied to tree in .
Now, we can finally define the Robinson-Foulds metric on the space of trees . We let denote the minimum number of applications of the and operations required to transform into (whenever we say that two trees are the same we mean that there exists a label preserving isomorphism).
First, let’s make sure that the distance function is well-defined. We note that any operation can be reversed via an appropriately chosen operation. Furthermore, it is clear that for any tree within the operations we will obtain a tree consisting of a single node with the label (representing all taxa) and no edges. Thus, for any two trees it follows that we can transform both to the tree with one node, and then apply appropriate inverse transformations, implying that there is a finite (and in fact bounded above by ) sequence of operations that transforms into (and vice versa). We will not prove that is indeed a metric, but the proofs of each of the properties are easy.
Robinson-Foulds metric: properties and caveats
Robinson-Foulds (RF) metric comes with several nice properties which made it ubiquitous in the world of phylogenetic analyses: (1) it can be efficiently computed (linear time in number of taxa via Day, 1985 or approximate solution in sub-linear time), and (2) it has intuitive interpretation as the number of distinct splits between the two trees (for more details see Felsenstein, 2004). However, RF metric also suffers from several issues. First, the placement of a single taxon can easily produce maximum RF distance between two otherwise identical trees. The figure below illustrates this example for two unrooted trees with 5 taxa.
Furthermore, the problem of large RF distances is a general issue with this metric. For example, we have generated 945 rooted trees on 6 taxa, and computed pairwise distance between distinct pairs of trees. The resulting histogram of normalized RF distances is shown below.
As we can see more than a third of all distances achieved the maximum possible value. Put differently, if we view the discrete metric as the least informative (everything is distance 1 away from any other point) then RF metric is only slightly better than the discrete one. This obviously presents us with a problem when we want our comparisons to provide a relative ranking of similarity with respect to a reference tree or when we want to use the distances as a tool for comparison between multiple sets of data.
There are a number of refinements of the RF metric which attempt to overcome some of the issues we pointed out above. Of particular interest is recent work of Smith, 2020 in which the author proposes an information theoretic generalization of the RF metric. Key idea in this paper is to use the information content of each split in order to weight its impact on the measured distance. Furthermore, the paper provides a set of comparisons with other generalizations of the RF metric, and as such can serve as a good overview of the topic for readers curious to learn more about the current state of the field.
Data and code availability
All code used in this post is available as a Jupyter notebook in the attached archive.
Robinson-Foulds metric serves as an important example at the start of the journey into tree geometry and comparative analyses. However, since it is purely focused on the topology of the tree it lacks expressiveness required to compare phylogenetic trees with branch lengths. Thus, we need different methods in order to compare those. Furthermore, the arising geometries (spoiler alert: there is more than one) are quite interesting in themselves. Thus, we will likely return to the topic of tree distances later on in the series.
Day, William HE. “Optimal algorithms for comparing trees with labeled leaves.” Journal of classification 2, no. 1 (1985): 7-28. https://doi.org/10.1007/BF01908061
Felsenstein, Joseph. Inferring phylogenies. Vol. 2. Sunderland, MA: Sinauer associates, 2004.
Robinson, David F., and Leslie R. Foulds. “Comparison of phylogenetic trees.” Mathematical biosciences 53, no. 1-2 (1981): 131-147. https://doi.org/10.1016/0025-5564(81)90043-2
Smith, Martin R. “Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees.” Bioinformatics 36, no. 20 (2020): 5007-5013. https://doi.org/10.1093/bioinformatics/btaa614