# Phylogenetic Vignettes: π-space

In the previous post we have introduced the notion of a CAT(0) space, and briefly talked about some of the properties of CAT(0) spaces, and in particular cubical complexes. In this post, we will take this knowledge a step further into the applied direction of phylogeny. Following Gavryushkin and Drummond, 2016 we will define a metric on the space of ultrametric phylogenetic trees, show that the resulting space is CAT(0), and then discuss some of the implications of these results. Similarly to the previous post, we will have quite a bit of mathematical notation to deal with, so I will attempt to use visual aids whenever possible to improve exposition.

## Motivation

One way of defining a metric on a space of phylogenetic trees is to embed the tree space $\mathcal{T}$ into some metric space $\mathcal{M}$. Thus, by associating every tree in the tree space with some point in the chosen metric space, we induce a pseudometric on tree space. However, such embeddings/parameterizations are not guaranteed to be “nice” with respect to the questions we aim to answer. Namely, not all embeddings are injective (multiple distinct trees can end up being mapped to the same point in $M$, hence in general we induce a pseudometric rather than a metric via embedding) although in the case of Gavryushkin and Drummond, 2016 the embedding has to be injective by definition. Additionally, not all embeddings are surjective (meaning that a path in $M$ might contain non-tree associated points) which creates a plethora of problems ranging from potential multiple distance minimizing solutions to the lack of tree space midpoints.

Thus, it is natural to formulate a list of desiderata for the embedding $p:\mathcal{T}\to\mathcal{M}$ that would enable fruitful analyses. From now onwards (unless otherwise specified) we will assume that the embedding $p$ is injective.

In order to make continuous distributions behave correctly under the pullback into the tree space, we have to require the image of the embedding to be path-connected. Note, that having a path connected image is also intuitively “nice” as the continuous tree transformations would result in a path being traced out in the model metric space. Hence, we have:

(D1) The set $\mathrm{Image}(p)$ is path-connected in $\mathcal{M}$.

However, being path-connected does not guarantee that the shortest path between the two points will lie within the image. Hence, we also want:

(D2) The set $\mathrm{Image}(p)$ is convex in $\mathcal{M}$.

We note that the injectivity criterion forces a lower bound on the dimension of $\mathcal{M}$. However, if we want to define probability measure on $\mathcal{M}$ that can be meaningfully pulled back onto $\mathcal{T}$ we need to ensure that:

(D3) $\mathrm{Image}(p)$ has the same dimension as $\mathcal{M}$.

Next, for statistical analyses to stay sound we need to ensure uniqueness of the shortest paths. This is the point in our desiderata list where you might have a brief flashback to the CAT(0) space discussion from previous week.

(D4) The space $\mathcal{M}$ is uniquely geodesic.

Finally, since we are ultimately tackling these questions from the computer science viewpoint, we have the last desiderata that ties this into the computational realm. Namely, we want the following:

(D5) Geodesics in $\mathcal{M}$ are computable.

In reality we probably want an even stronger version of (D5), which is:

(D5′) Geodesics in $\mathcal{M}$ are efficiently (i.e. polynomial time) computable.

It’s worth remarking that in a general metric space we do not necessarily have (D5) with a simple example being the halting problem reduction to shortest paths in an infinite graph.

It is also important to realize that these desiderata do not guarantee existence of such embeddings, and are rather criteria for discerning more or less useful parameterizations of the tree space.

## A quick aside

In general, we can attempt to parametrize the whole tree space (that’s precisely what Billera, Holmes and Vogtmann, 2001 do in their paper, which lies at the foundation of the connection between the geometry of non-positive curvature spaces and phylogenetic trees; we will refer to this space as the BHV space going forward) or we might focus on a specific subclass of trees such as ultrametric phylogenetic trees (recall that a tree is ultrametric is the distance from root to every leaf is the same). The caveat is that a parametrization that enjoys our desiderata (D1)-(D5′) for the whole tree space can fail to do so when restricted to the space of ultrametric trees. This is precisely the case for the BHV space parametrization. Hence, it can be of interest (motivated by the evolutionary models context) to explore parametrizations designed specifically for the space of the ultrametric trees. This is precisely what is done in Gavryushkin and Drummond, 2016, and what we are aiming to do in this post.

## π-space

We now will construct one version of a space of ultrametric trees called the π-space. To proceed consider an ultrametric tree T on n taxa, with internal nodes ordered by their time from the extant taxa. We will then parametrize the tree by mapping it to its ranked topology and a n-1-dimensional vector $\overline{\tau}=(\tau_1,...,\tau_{n-1})$, where $\tau_i$ is the time difference between the i-th and i+1-th nodes. The picture below taken from Gavryushkin and Drummond, 2016 represents this parametrization for a tree on 5 taxa.

Hence, formally we have the mapping given by $p(T)=(\mathrm{rt}(T), \overline{\tau}(T)$ that embeds the space of ultrametric phylogenetic trees into a disjoint union of $m=\frac{n!(n-1)!}{2^{n-1}}$ (Semple and Steel, 2003) non-negative real n-1-dimensional orthants (i.e. sets of the form $\mathbb{R}^{n-1}_0=\{(x_1,...,x_{n-1}\in\mathbb{R}^n|x_i\geq 0\}$). We will impose an upper bound on all orthants to turn our construction into a cubical complex for the ease of the exposition. However, the construction of π-space with orthants will still yield a CAT(0) space with the key properties and desiderata (D1)-(D5′) preserved.

Note that the ranked topology in this case is a stricter condition on the shape of the tree than just the topology constraint. In particular, the two trees pictured below have distinct ranked topologies, and hence will belong to two distinct orthants in the π-space.

Finally, to properly turn this into a cubical complex we will need to identify the isometries that glue the faces of our cubes together. However, this is rather obvious by construction, since whenever any π coordinate becomes 0 we observe a collapse of two nodes in the ranked hierarchy to the same level. Hence, for example the two trees shown above will belong to the cubes that share the $\tau_2=0$ face. Since the resulting space is a cubical complex, we can naturally consider the Euclidean metric within each cube with paths for joining points in different cubes being the sums of the respective within cube paths.

The clip below shows how a tree can vary in π-space while being restricted to a single orthant determined by its ranked topology. Note that just as we discussed above in order to stay within the orthant in π-space we must keep the ranked topology constant.

For comparison, we show how a similar ultrametric tree can vary within a single fixed orthant of a BHV-space. In this case, the topology of the tree is fixed, but the ranked topology can vary.

## Properties

After defining the π-space it is natural to ask whether it manages to achieve our desiderata. In order to check these properties we need to first figure out how many common faces two cubes in our complex can share (recall from the previous post that a theorem of M. Gromov gives characterization of CAT(0) cubical complexes in terms of face-sharing properties).

In general, similarly to the previous post by face of a cube we mean any sub-cube, thus it is useful to have a separate term for faces of codimension 1 (i.e. faces that are 1 less dimensional than the cubes). Hence, we will call faces of codimension 1 facets. In particular, any facet can be shared by 1, 2 or 3 cubes in total. If we let $t_1=0$ , then the resulting facet is only contained in the cube of corresponding rooted topology. If setting $t_i=0$ does not result in a multifurcating topology then the corresponding facet will be shared by exactly two cubes (see example in Figure 2 and/or Figure 2 of Gavryushkin and Drummond, 2016). Finally, if we get a multifurcation then the total number of cubes sharing the facet is 3.

Now, we are ready to take a stab at the critical result about the π-space, namely that it is a CAT(0) space, and hence uniquely geodesic. We will recall the theorem of M. Gromov here to reiterate the key result we need to prove.

Theorem 1. A cubical complex $K$ with the intrinsic Euclidean metric is CAT(0) if and only if $K$ is connected, simply connected, and for all natural $k$: if three $(k+2)$-cubes of $K$ share a common $k$-cube and pairwise share common distinct $(k+1)$-cubes, then they are contained within a $(k+3)$-cube in $K$.

We start by noting that the metric we introduced on our cubical complex is precisely the intrinsic Euclidean metric, and it is easy to see from our construction that the resulting complex is connected and simply connected.

To finalize the proof we need to introduce the concept of the link of a vertex v in a complex. We say that the graph G with nodes given by facets containing v and edges given by cubes that contain the two incident nodes as facets is the link of the vertex v.

First, we need to show that the cubes of the dimension (k+2) in the theorem cannot be the highest dimensional cubes in the complex. If that was the case, then the link of origin would have to contain a 3-cycle (given by the pairwise distinct (k+1)-cubes) which contradicts the result that the nearest-neighbor interchange graph does not contain 3-cycles. Thus, it follows that each of the (k+2)-cubes has at least one coordinate equal to 0. For the sake of clarity of the presentation we will assume that this is unique coordinate for each of the cubes, although the similar argument can be made in the general case. Let i, j, and r denote the respective coordinates in three cubes. We have three cases to analyze:

(i=j=r) This is impossible due to no 3-cycle property of the link of the origin.

(i=j) In this case the first two cubes must share a (k+1)-cube, and since all shared (k+1)-cubes must be pairwise distinct it cannot be the cube obtained via setting $\tau_r=0$. Hence, there exists $\tau_s$ s.t. it is greater than zero in both of the first two cubes, and is zero in their shared (k+1)-cube. Now, if the first and third cube share a (k+1)-cube then it follows that their i and r coordinates both have to be zero in the shared cube, implying that the s coordinate has to be resolved in the same way between the first and third cube. However, the same exact argument can be repeated for the second and third cube implying that the first and second cubes had to be indetical.

(all distinct) In this case the left out coordinate (r for the first and second cube, i for the second and third, j for the first and third) has to be resolved the same way in the two cubes under consideration. Now, we can construct a (k+3)-cube that contains all three cubes by taking the first cube and resolving its zero coordinate (i) the same way as it is resolved in the remaining two cubes.

This analysis concludes the proof of the property required to claim that the cubical complex we defined is indeed CAT(0). It immediately follows that the geodesics in π-space are unique. At this point we are essentially done with (D1)-(D4) and the only remaining piece is the (efficient) computability of the said geodesics.

## Efficiently computing CAT(0) geodesics

In their manuscript Gavryushkin and Drummond, 2016 indicate that the algorithm proposed by Owen and Provan, 2010 will work in π-space, and provide a link to a Java implementation hosted here:

Due to the constraints on the length and scope of the post we will not dive into how does the algorithm of Owen and Provan, 2010 work in the π-space. Instead, we will discuss the algorithm in its own dedicated post.

## Data and code availability

Code used to generate the π-space and BHV-space tree examples, as well as HTML outputs from Bokeh are provided in the following archive.

## What’s next

We spent a noticeable amount of time discussing mathematical properties of different metrics that can be imposed on tree spaces. As our next step, we will switch the lens to a computational perspective and explore the algorithm proposed by Owen and Provan, 2010. We might also touch upon more recent work in the area of the efficient computation of geodesics in CAT(0) cubical complexes. Once we are satisfied with our algorithmic solutions, we will continue the journey by exploring the BHV-space and t-space of phylogenetic trees.

## References

Billera, Louis J., Susan P. Holmes, and Karen Vogtmann. “Geometry of the space of phylogenetic trees.” Advances in Applied Mathematics 27, no. 4 (2001): 733-767. https://doi.org/10.1006/aama.2001.0759

Bridson, Martin R., and AndrΓ© Haefliger. Metric spaces of non-positive curvature. Vol. 319. Springer Science & Business Media, 2013. https://doi.org/10.1007/978-3-662-12494-9

Gavryushkin, Alex, and Alexei J. Drummond. “The space of ultrametric phylogenetic trees.” Journal of theoretical biology 403 (2016): 197-208. arXiv:1410.3544

Owen, Megan, and J. Scott Provan. “A fast algorithm for computing geodesic distances in tree space.” IEEE/ACM Transactions on Computational Biology and Bioinformatics 8, no. 1 (2010): 2-13. https://doi.org/10.1109/TCBB.2010.3

Semple, Charles, and Mike Steel. Phylogenetics. Vol. 24. Oxford University Press on Demand, 2003.