# Phylogenetic Vignettes: Robinson-Foulds metric

Often in biology applications we are interested not only in inferring the optimal tree under some evolutionary model, but also in comparing two different trees for the same set of taxa (or even two different trees for two overlapping but not necessarily identical sets of taxa), it is natural to ask a question about the the dissimilarity measures between trees. In general, the question of defining a metric over the space of phylogenetic trees is quite rich and interesting, and we will explore this in more detail over several posts. However, it is useful to start with the simplest case in which we compare two unrooted trees without branch lengths over a fixed set of taxa. While multiple methods have been proposed to address this question, the metric proposed by Robinson and Foulds, 1981 remains a popular choice in practice (the paper has more than 2.6k citations according to Google Scholar), and more importantly for us serves as a great starting point for the exploration of the metrics on tree spaces.

Before we dive into the details of Robinson-Foulds metric it might be useful to briefly review some of the general metric space concepts and definitions.

## Review: Metric spaces

A metric space is an ordered pair $(X, d)$ where $X$ is a set and $d$ is a function $d: X\times X\to \mathbb{R}$ satisfying

• $d(x, y) = 0 \iff x=y$
• $d(x, y) = d(y, x)$
• $d(x, y) \leq d(x, z) + d(y, z)$

for any $x, y, z\in X$. Recall from the previous post that in the case when the third condition (called triangle inequality) is replaced by a stricter condition $d(x, y)\leq\mathrm{max}\{d(x, z), d(y, z)\}$ we obtain an ultrametric space.

Note that given any set $X$ we can create a simple metric on it by setting $d(x, y) = 1$ if $x\neq y$ and $0$ otherwise. This metric is called the discrete metric (and it is trivial to check that discrete metric is also an ultrametric). Discrete metric provides us with an important realization that any set can be turned into a metric space. However, it is clear that discrete metric is not useful in a practical comparison scenario.

We also note that a metric on a set gives rise to topology, so any metric space is also a topological space. The base for topology on a metric space $X$ is defined by the open balls $B(x, r)=\{y\in X|d(y, x).

In case when the underlying set $X$ is finite, we call the resulting metric space a finite metric space. Note that if we consider trees without branch lengths then the resulting set is finite, and hence the corresponding metric space is a finite metric space. This is not the case when we consider the branch lengths, but we will leave the discussion of the geometry of phylogenetic trees with branch lengths to later posts.

## Robinson-Foulds metric

Given a set of taxa $\mathcal{X}$ let us define the set of all unrooted (not necessarily bifurcating) trees with the property that all nodes of degree 1 or 2 are labeled by pairwise disjoint non-empty subsets of $\mathcal{X}$ as $\mathcal{T}$. Below are a few examples of trees in $\mathcal{T}$ for the set of taxa $\mathcal{X}=\{A, B, C, D, E\}$. Unlabeled nodes are considered to have $\emptyset$ as their label.

Conversely neither of the trees shown below are in $\mathcal{T}$, since they violate the labeling properties.

Note, that any unrooted phylogenetic tree on $\mathcal{X}$ will be an element of $\mathcal{T}$, but not all trees in $\mathcal{T}$ are fully resolved unrooted phylogenies.

Now, we will define two operations on $\mathcal{T}$ named contraction (denoted by $\alpha$) and decontraction (denoted $\alpha^{-1}$).

Contraction will consist of an edge contraction together with the union operation on the corresponding node label sets. More precisely given a tree $T\in\mathcal{T}$ with the labeling defined by $L:V(T)\to\mathcal{P}(\mathcal{X})$ and an edge $\{u, v\}=e\in T$ we will define the contraction $\alpha(T, e)=T^\prime$ by considering $T^{\prime\prime}=(V\setminus\{u, v\}, E(T\setminus\{f\in E(T)|f\text{ is incident to }u{ or }v\}$ and adding a vertex $w$ with $L(w) = L(u)\cup L(v)$ and edges $E_w = \{\{t, w\}|\forall w\text{ s.t. }\{w, u\}\in E(T)\lor\{w, v\}\in E(T)\}$. Thus, we have $\alpha(T, e) = T^{\prime} = (V(T^{\prime\prime})\cup{w}, E(T^{\prime\prime})\cup{E_w})$. The figure below illustrates an example of subsequent contractions on tree in $\mathcal{T}$ for $\mathcal{X}=\{A, B, C, D, E\}$.

Decontraction will consist of an “inverse” operation in which we will split a node by creating an edge. Note that the caveat here is that such an operation depends not only on the choice of the node to split, but also on the choice of a label set and incident edges set split. Hence, the use of notation $\alpha^{-1}$ is misleading since applying a decontraction after a contraction does not necessarily result in the original tree. Hence, to define a decontraction precisely we will define it as a function $\alpha^{-1}(T, v, s, \epsilon) = T^{\prime}$. Let $s = (S_1, S_2)$ where $S_1\cup S_2=L(v)$ and $S_1\cap S_2=\emptyset$. Let $\epsilon = (E_1, E_2)$ where $E_1\cup E_2=\{e\in E(T)|v\in e\}$ and $E_1\cap E_2=\emptyset$. Now, we can construct the tree $T^{\prime\prime}=(V(T)\setminus\{v\}, E(T)\setminus(E_1\cup E_2))$ and add two vertices $u, w$ to it alongside the edge $\{u, w\}$ and redistribute the edges previously incident on $v$. Formally, $\alpha^{-1}(T, v, s, \epsilon) = T^{\prime} = (V(T^{\prime\prime})\cup\{u, w\}, E(T^{\prime\prime})\cup\{\{u, t\}|\exists e\in E_1, t\in e\land t\neq v\}\cup\{\{w, t\}|\exists e\in E_2, t\in e\land t\neq v\})$ with the labeling function defined as $L(u) = S_1$ and $L(w)=S_2$ for the new nodes. The figure below illustrates two of the possible ways a decontraction can be applied to tree in $\mathcal{T}$.

Now, we can finally define the Robinson-Foulds metric on the space of trees $\mathcal{T}$. We let $d(T_1, T_2)$ denote the minimum number of applications of the $\alpha$ and $\alpha^{-1}$ operations required to transform $T_1$ into $T_2$ (whenever we say that two trees are the same we mean that there exists a label preserving isomorphism).

First, let’s make sure that the distance function is well-defined. We note that any $\alpha$ operation can be reversed via an appropriately chosen $\alpha^{-1}$ operation. Furthermore, it is clear that for any tree $T = (V, E)$ within the $|E|$ $\alpha$ operations we will obtain a tree consisting of a single node with the label $\mathcal{X}$ (representing all taxa) and no edges. Thus, for any two trees $T_1, T_2$ it follows that we can transform both to the tree with one node, and then apply appropriate inverse transformations, implying that there is a finite (and in fact bounded above by $2|\mathcal{X}|-2$) sequence of $\alpha, \alpha^{-1}$ operations that transforms $T_1$ into $T_2$ (and vice versa). We will not prove that $d(T_1, T_2)$ is indeed a metric, but the proofs of each of the properties are easy.

## Robinson-Foulds metric: properties and caveats

Robinson-Foulds (RF) metric comes with several nice properties which made it ubiquitous in the world of phylogenetic analyses: (1) it can be efficiently computed (linear time in number of taxa via Day, 1985 or approximate solution in sub-linear time), and (2) it has intuitive interpretation as the number of distinct splits between the two trees (for more details see Felsenstein, 2004). However, RF metric also suffers from several issues. First, the placement of a single taxon can easily produce maximum RF distance between two otherwise identical trees. The figure below illustrates this example for two unrooted trees with 5 taxa.

Furthermore, the problem of large RF distances is a general issue with this metric. For example, we have generated 945 rooted trees on 6 taxa, and computed pairwise distance between distinct pairs of trees. The resulting histogram of normalized RF distances is shown below.

As we can see more than a third of all distances achieved the maximum possible value. Put differently, if we view the discrete metric as the least informative (everything is distance 1 away from any other point) then RF metric is only slightly better than the discrete one. This obviously presents us with a problem when we want our comparisons to provide a relative ranking of similarity with respect to a reference tree or when we want to use the distances as a tool for comparison between multiple sets of data.

There are a number of refinements of the RF metric which attempt to overcome some of the issues we pointed out above. Of particular interest is recent work of Smith, 2020 in which the author proposes an information theoretic generalization of the RF metric. Key idea in this paper is to use the information content of each split in order to weight its impact on the measured distance. Furthermore, the paper provides a set of comparisons with other generalizations of the RF metric, and as such can serve as a good overview of the topic for readers curious to learn more about the current state of the field.

## Data and code availability

All code used in this post is available as a Jupyter notebook in the attached archive.

## What’s next?

Robinson-Foulds metric serves as an important example at the start of the journey into tree geometry and comparative analyses. However, since it is purely focused on the topology of the tree it lacks expressiveness required to compare phylogenetic trees with branch lengths. Thus, we need different methods in order to compare those. Furthermore, the arising geometries (spoiler alert: there is more than one) are quite interesting in themselves. Thus, we will likely return to the topic of tree distances later on in the series.

## References

Day, William HE. “Optimal algorithms for comparing trees with labeled leaves.” Journal of classification 2, no. 1 (1985): 7-28. https://doi.org/10.1007/BF01908061

Felsenstein, Joseph. Inferring phylogenies. Vol. 2. Sunderland, MA: Sinauer associates, 2004.

Robinson, David F., and Leslie R. Foulds. “Comparison of phylogenetic trees.” Mathematical biosciences 53, no. 1-2 (1981): 131-147. https://doi.org/10.1016/0025-5564(81)90043-2

Smith, Martin R. “Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees.” Bioinformatics 36, no. 20 (2020): 5007-5013. https://doi.org/10.1093/bioinformatics/btaa614