Important announcement
I am currently deciding on the direction this post series will take after next week. Please check out the What’s next section and let me know if you have any preferences among described routes.
Recap
In the previous post (following Gavryushkin and Drummond, 2016) we have defined a geometry on the space of phylogenetic ultrametric trees and called the resulting metric space π-space. We also briefly mentioned the geometry on the space of trees defined by Billera, Holmes and Vogtmann, 2001, and referred to the resulting construction as BHV space.
In this post we will provide a brief overview of the algorithm presented in Owen and Provan, 2010, which allows us to compute the geodesics in π-space and BHV space in polynomial time (in the number of the tree leaves/taxa). For the ease of exposition we will delegate detailed overview of the BHV space to a later post. Hence, in what follows we will assume without the proof that BHV space is a CAT(0) complex and trees in BHV space are parametrized by the internal branch lengths with different orthants corresponding to different tree topologies.
Goal
As we mentioned before one of the key properties we want from a metric space structure on the tree space is the efficient computation of geodesics. Clearly given any finite metric space (complexes constructed in BHV or π-spaces are not finite, but the topologies are and the problem of distance within a fixed orthant determined by topology is easy to solve) we can simply consider all possible paths and then pick the one minimizing the length function. However, such approach yields an algorithm that will fail to finish computations even for modestly sized spaces.
Thus, our goal is to have an efficient algorithm that can find shortest path in polynomial time with respect to the number of leaves our trees can have.
CAT(0) properties
Since both BHV and π-spaces are CAT(0) we can leverage nice properties of the non-positively curved geometries to get an idea for an efficient algorithm. The key idea behind the algorithm comes from the following lemma (Lemma 2.1 in Owen and Provan, 2010, and Proposition 1.4, p. 160 in ).
Lemma 1. In CAT(0) space, every local geodesic is a geodesic.
Proof. Consider a local geodesic between points
and
. Choose a set of points
such that
. Note that it follows from the definition of a local geodesic that
is a geodesic. Now, assume by induction that
is a geodesic for all $j\leq i$. Hence, we have
by the induction hypothesis and
as noted above. Consider a corresponding model triangle
in Euclidean space (if you forgot what a model triangle is check: Bridson and Haefliger, 2013 p. 158 or previous post on CAT spaces). Let
be a point on
such that
. Since we are working in a CAT(0) space it follows that
. On the other hand, by induction hypothesis we have
. Combining this result with triangle inequality gives us
, and hence
and therefore
. From here we can conclude that
is a geodesic and hence $\Gamma_\varepsilon$ is a geodesic.
Thus, the main idea used in the paper is to start with any path between two trees (consider the path from a tree to the star tree and then back to target tree, this path called cone path is a common idea in the analyses of tree spaces; recall: Robinson-Foulds metric) and iteratively check whether it’s a local geodesic refining the path in cases when that’s not the case.
Path space geodesics
Following Owen and Provan, 2010 we will assume that the trees between which we are computing geodesics have distinct edges. We will also call two sets of edges
and
compatible if all pairs of splits associated with these edges are compatible, or equivalently if the union
defines a tree. In other words, compatible sets of edges produce “intermediate” trees between
and
. Now, for a pair of partitions
,
of
and
, respectively, we can define an associated path space given that the following property holds: for every
,
and
are compatible. To construct the path space consider the collection of orthants
given by
, where
is the mapping of a tree to the corresponding orthant. We will call the corresponding partition pair
the support, and the shortest path between
and
in
the path space geodesic. These definitions make the geodesic in the tree space a more concrete and constructable object, which of course is important for any algorithmic approach. Furthermore, the following theorem from Billera, Holmes and Vogtmann, 2001 (see Proposition 4.1) guarantees that the path space geodesics are recovering the true geodesics.
Theorem 1. For disjoint trees the geodesic is a path space geodesic for some choice of
.
We will also need the following two results, the first of which is proven in Owen, 2011 and the second in Owen and Provan, 2010 (Theorem 2.5).
Theorem 2. Let be a geodesic between
and
. Then there exist partitions
latex \mathcal{B}=\{B_1,…,B_k\} of
and
, respectively, such that
(a) form a support of path space in which
is a path space geodesic, and (b) the following condition is satisfied:
.
A path space satisfying the above condition for its support is referred to as a proper path space, and the corresponding path space geodesic is called a proper path.
Theorem 3. A proper -path
with support
is a geodesic iff for each pair of edge sets
in the support there is no nontrivial partition
,
such that
is compatible with
and
.
Theorem 3 gives us the final ingredient needed to build out an algorithm for finding BHV space geodesics. Namely, partitions of edge sets give us a way to construct proper paths and the condition on local improvement of such proper paths guarantees that we achieve a geodesic.
Algorithm overview
As we mentioned earlier the natural choice for the starting path is the cone path between the two trees and
, more formally it’s a proper path space defined by the support
and the corresponding path is given by uniform contraction of all edges which results in star tree and then a follow up uniform decontraction. Now, the algorithm will proceed iteratively at each step checking whether the condition in Theorem 3 is satisfied and if not proposing a new support and a new proper path.
In order to facilitate an efficient solution for the iterative step, the problem of checking the condition of Theorem 3 and finding a new path can be recasted as an independent set problem over a bipartite graph. Namely, we will construct the incompatibility graph latex A\subseteq E(T), B\subseteq E(T^\prime)$ as a bipartite graph with vertices given by
and edges given by the pairs
where the splits associated with
and
are incompatible. It is easy to see that for any
compatibility of
and
is equivalent to
forming an independent set in $G(E(T), E(T^\prime))$.
Thus, the condition from Theorem 3, can now be recasted as follows: Given two edge sets $A\subseteq E(T), B\subseteq E(T^\prime)$ does there exist a pair of partitions ,
such that the vertex set
is independent in
and
.
In particular, a proper path with support is geodesic iff no edge set pair
satisfies the above condition.
Now, since the complement of an independent set is a vertex cover we can reformulate the above problem as a minimum weight (by transforming the inequality condition) vertex cover problem, which can be solved on bipartite graphs in . Further analysis of the algorithm steps yields that the total runtime to find the geodesic is
. For the exact details of the proof of correctness of the algorithm and the runtime I suggest reading the original manuscript by Owen and Provan, 2010.
Data/code availability
This post has no associated custom code or data. Those looking for an implementation of the algorithm described should check out Megan Owen’s webpage or GitHub repo.
What’s next
This post concludes the beginning of our exploration of the geometry of the space of phylogenetic trees. There are several ways we can proceed from here.
First, we can finally start tackling the foundational paper of Billera, Holmes and Vogtmann, 2001. This path will likely require us to spend a few posts dissecting original BHV space and some associated results from metric geometry and CAT(0) spaces. We then can return to the paper of Gavryushkin and Drummond, 2016 and explore the t-space construction in a way similar to our exploration of the π-space. Finally, we can then continue our journey by catching up to more recent work by Megan Owen et al. on defining and computing statistics in BHV space (Brown and Owen, 2020), comparing phylogenetic trees with different leaf sets (Grindstaff and Owen, 2018), and extending the framework to phylogenetic cactuses (a type of phylogenetic network, Huber et al., 2021).
Second, we can instead pivot completely and ask a general question of what other geometries one can consider on the space of phylogenetic trees. In particular, we can take a look at algebraic geometry inspired view of the phylogenetic tree space as a tropical geometry (Monod et al., 2018), and an information geometry angle which gives rise to wald space (pun credit to the authors) of the phylogenetic trees (Garba et al., 2021). I will most likely skip the category theoretic angle for the topic (Baez and Otter, 2015), since frankly speaking I have no desire to introduce the notion of an operad.
Third, we can temporarily halt the process of investigating deeper into the various geometries on the tree space and ask more broadly what are geometries of interest in the case of phylogenetic networks. This route will require us to go back to some definitions of the phylogenetic network and related concepts, and will likely demand at least some amount of rehashing of the biological context.
Since, I am still debating which route is of the most interest to me at the moment, next week’s post will likely be sparser than usual, as I am figuring out which rabbit hole to pick as the next destination. However, if you actually would prefer me to pick one of these three directions do not hesitate to comment or reach out to me via other means of contact.
References
Baez, John C., and Nina Otter. “Operads and phylogenetic trees.” arXiv preprint arXiv:1512.03337 (2015). arXiv:1512.03337
Billera, Louis J., Susan P. Holmes, and Karen Vogtmann. “Geometry of the space of phylogenetic trees.” Advances in Applied Mathematics 27, no. 4 (2001): 733-767. https://doi.org/10.1006/aama.2001.0759
Bridson, Martin R., and AndrΓ© Haefliger. Metric spaces of non-positive curvature. Vol. 319. Springer Science & Business Media, 2013. https://doi.org/10.1007/978-3-662-12494-9
Brown, Daniel G., and Megan Owen. “Mean and variance of phylogenetic trees.” Systematic biology 69, no. 1 (2020): 139-154. https://doi.org/10.1093/sysbio/syz041, arXiv:1708.00294
Garba, Maryam Kashia, Tom MW Nye, Jonas Lueg, and Stephan F. Huckemann. “Information geometry for phylogenetic trees.” Journal of Mathematical Biology 82, no. 3 (2021): 1-39. https://doi.org/10.1007/s00285-021-01553-x, arXiv:2003.13004
Gavryushkin, Alex, and Alexei J. Drummond. “The space of ultrametric phylogenetic trees.” Journal of theoretical biology 403 (2016): 197-208. arXiv:1410.3544
Grindstaff, Gillian, and Megan Owen. “Geometric comparison of phylogenetic trees with different leaf sets.” arXiv preprint arXiv:1807.04235 (2018). arXiv:1807.04235
Huber, Katharina T., Vincent Moulton, Megan Owen, Andreas Spillner, and Katherine St John. “The space of equidistant phylogenetic cactuses.” arXiv preprint arXiv:2111.06115 (2021). arXiv:2111.06115
Monod, Anthea, Bo Lin, Ruriko Yoshida, and Qiwen Kang. “Tropical geometry of phylogenetic tree space: a statistical perspective.” arXiv preprint arXiv:1805.12400 (2018). arXiv:1805.12400
Owen, Megan. “Computing geodesic distances in tree space.” SIAM Journal on Discrete Mathematics 25, no. 4 (2011): 1506-1529. arXiv:0903.0696
Owen, Megan, and J. Scott Provan. “A fast algorithm for computing geodesic distances in tree space.” IEEE/ACM Transactions on Computational Biology and Bioinformatics 8, no. 1 (2010): 2-13. https://doi.org/10.1109/TCBB.2010.3
Pingback: Phylogenetic Vignettes: Tree Rearrangements – Nicolae's Homepage