Phylogenetic Vignettes: Owen-Provan algorithm

Important announcement

I am currently deciding on the direction this post series will take after next week. Please check out the What’s next section and let me know if you have any preferences among described routes.

Recap

In the previous post (following Gavryushkin and Drummond, 2016) we have defined a geometry on the space of phylogenetic ultrametric trees and called the resulting metric space 𝛕-space. We also briefly mentioned the geometry on the space of trees defined by Billera, Holmes and Vogtmann, 2001, and referred to the resulting construction as BHV space.

In this post we will provide a brief overview of the algorithm presented in Owen and Provan, 2010, which allows us to compute the geodesics in 𝛕-space and BHV space in polynomial time (in the number of the tree leaves/taxa). For the ease of exposition we will delegate detailed overview of the BHV space to a later post. Hence, in what follows we will assume without the proof that BHV space is a CAT(0) complex and trees in BHV space are parametrized by the internal branch lengths with different orthants corresponding to different tree topologies.

Goal

As we mentioned before one of the key properties we want from a metric space structure on the tree space is the efficient computation of geodesics. Clearly given any finite metric space (complexes constructed in BHV or 𝛕-spaces are not finite, but the topologies are and the problem of distance within a fixed orthant determined by topology is easy to solve) we can simply consider all possible paths and then pick the one minimizing the length function. However, such approach yields an algorithm that will fail to finish computations even for modestly sized spaces.

Thus, our goal is to have an efficient algorithm that can find shortest path in polynomial time with respect to the number of leaves our trees can have.

CAT(0) properties

Since both BHV and 𝛕-spaces are CAT(0) we can leverage nice properties of the non-positively curved geometries to get an idea for an efficient algorithm. The key idea behind the algorithm comes from the following lemma (Lemma 2.1 in Owen and Provan, 2010, and Proposition 1.4, p. 160 in ).

Lemma 1. In CAT(0) space, every local geodesic is a geodesic.

Proof. Consider a local geodesic $\Gamma_\varepsilon$ between points $P$ and $Q$. Choose a set of points $P=P_0,P_1,...,P_k=Q$ such that $L(\Gamma_\varepsilon(P_i, P_{i+1}))\leq \varepsilon/2$. Note that it follows from the definition of a local geodesic that $\Gamma_\varepsilon(P_{i-1}, P_{i+1})$ is a geodesic. Now, assume by induction that $\Gamma_\varepsilon(P_0, P_j)$ is a geodesic for all $j\leq i$. Hence, we have $L(\Gamma_\varepsilon(P_0, P_{i-1}))=d(P_0, P_{i-1})$ by the induction hypothesis and $L(\Gamma_varepsilon(P_{i-1},P_{i+1}))=d(P_{i-1}, P_{i+1})$ as noted above. Consider a corresponding model triangle $p_0p_{i-1}p_{i+1}$ in Euclidean space (if you forgot what a model triangle is check: Bridson and Haefliger, 2013 p. 158 or previous post on CAT spaces). Let $p_i$ be a point on $p_{i-1}p_{i+1}$ such that $|p_{i-1}p_i|=d(P_{i-1}, P_i)$. Since we are working in a CAT(0) space it follows that $d(P_0, P_i)\leq |p_0p_i|$. On the other hand, by induction hypothesis we have $d(P_0, P_i) = d(P_0, P_{i-1}) + d(P_{i-1}, P_i) = |p0p_{i-1}|+|p_{i-1}p_i|$. Combining this result with triangle inequality gives us $|p0p_{i-1}|+|p_{i-1}p_i|=|p_0p_i|$, and hence $p_{i-1}\in p_0p_i$ and therefore $p_{i-1}\in p_0p_{i+1}$. From here we can conclude that $\Gamma_\varepsilon(P_0, P_{i+1}$ is a geodesic and hence $\Gamma_\varepsilon$ is a geodesic.

Thus, the main idea used in the paper is to start with any path between two trees (consider the path from a tree to the star tree and then back to target tree, this path called cone path is a common idea in the analyses of tree spaces; recall: Robinson-Foulds metric) and iteratively check whether it’s a local geodesic refining the path in cases when that’s not the case.

Path space geodesics

Following Owen and Provan, 2010 we will assume that the trees $T, T^\prime$ between which we are computing geodesics have distinct edges. We will also call two sets of edges $A\subseteq E(T)$ and $B\subseteq E(T^\prime)$ compatible if all pairs of splits associated with these edges are compatible, or equivalently if the union $A\cup B$ defines a tree. In other words, compatible sets of edges produce “intermediate” trees between $T$ and $T^\prime$. Now, for a pair of partitions $\mathcal{A}=\{A_1,...,A_k\}$, $\mathcal{B}=\{B_1,...,B_k\}$ of $E(T)$ and $E(T^\prime)$, respectively, we can define an associated path space given that the following property holds: for every $i>j$, $A_i$ and $B_j$ are compatible. To construct the path space consider the collection of orthants $\mathcal{P}(T, T^\prime)=\cup_{i=1}^k O_i$ given by $O_i=O(B_1\cup ...\cup B_i\cup A_{i+1}\cup ...\cup A_k)$, where $O(T)$ is the mapping of a tree to the corresponding orthant. We will call the corresponding partition pair $(\mathcal{A}, \mathcal{B})$ the support, and the shortest path between $T$ and $T^\prime$ in $\mathcal{P}$ the path space geodesic. These definitions make the geodesic in the tree space a more concrete and constructable object, which of course is important for any algorithmic approach. Furthermore, the following theorem from Billera, Holmes and Vogtmann, 2001 (see Proposition 4.1) guarantees that the path space geodesics are recovering the true geodesics.

Theorem 1. For disjoint trees $T, T^\prime$ the geodesic is a path space geodesic for some choice of $\mathcal{P}(T, T^\prime)$.

We will also need the following two results, the first of which is proven in Owen, 2011 and the second in Owen and Provan, 2010 (Theorem 2.5).

Theorem 2. Let $\Gamma$ be a geodesic between $T$ and $T^\prime$. Then there exist partitions $\mathcal{A}=\{A_1,...,A_k\} and$latex \mathcal{B}=\{B_1,…,B_k\} of $E(T)$ and $E(T^\prime)$, respectively, such that $(\mathcal{A}, \mathcal{B})$ (a) form a support of path space in which $\Gamma$ is a path space geodesic, and (b) the following condition is satisfied:

$\frac{\parallel A_1\parallel}{\parallel B_1\parallel}\leq \frac{\parallel A_2\parallel}{\parallel B_2\parallel}\leq\cdots\frac{\parallel A_k\parallel}{\parallel B_k\parallel}$.

A path space satisfying the above condition for its support is referred to as a proper path space, and the corresponding path space geodesic is called a proper path.

Theorem 3. A proper $(T, T^\prime)$-path $\Gamma$ with support $(\mathcal{A}, \mathcal{B})$ is a geodesic iff for each pair of edge sets $(A_i, B_i)$ in the support there is no nontrivial partition $C_1\cup C_2=A_i$, $D_1\cup D_2=B_i$ such that $C_2$ is compatible with $D_1$ and $\frac{\parallel C_1\parallel}{\parallel D_1\parallel}\leq \frac{\parallel C_2\parallel}{\parallel D_2\parallel}$.

Theorem 3 gives us the final ingredient needed to build out an algorithm for finding BHV space geodesics. Namely, partitions of edge sets give us a way to construct proper paths and the condition on local improvement of such proper paths guarantees that we achieve a geodesic.

Algorithm overview

As we mentioned earlier the natural choice for the starting path is the cone path between the two trees $T$ and $T^\prime$, more formally it’s a proper path space defined by the support $(\mathcal{A}=\{E(T)\}, \mathcal{B}=\{E(T^\prime)\})$ and the corresponding path is given by uniform contraction of all edges which results in star tree and then a follow up uniform decontraction. Now, the algorithm will proceed iteratively at each step checking whether the condition in Theorem 3 is satisfied and if not proposing a new support and a new proper path.

In order to facilitate an efficient solution for the iterative step, the problem of checking the condition of Theorem 3 and finding a new path can be recasted as an independent set problem over a bipartite graph. Namely, we will construct the incompatibility graph $G(A, B) between two edge sets$latex A\subseteq E(T), B\subseteq E(T^\prime)$as a bipartite graph with vertices given by $A\cup B$ and edges given by the pairs $e\in A, f\in B$ where the splits associated with $e$ and $f$ are incompatible. It is easy to see that for any $A\subseteq E(T), B\subseteq E(T^\prime)$ compatibility of $A$ and $B$ is equivalent to $A\cup B$ forming an independent set in$G(E(T), E(T^\prime))$. Thus, the condition from Theorem 3, can now be recasted as follows: Given two edge sets$A\subseteq E(T), B\subseteq E(T^\prime)\$ does there exist a pair of partitions $A=C_1\cup C_2$, $B=D_1\cup D_2$ such that the vertex set $C_2\cup D_1$ is independent in $G(A, B)$ and

$\frac{\parallel C_1\parallel}{\parallel D_1\parallel}\leq \frac{\parallel C_2\parallel}{\parallel D_2\parallel}$.

In particular, a proper path with support $(\mathcal{A}, \mathcal{B})$ is geodesic iff no edge set pair $(A_i\in\mathcal{A},B_i\in\mathcal{B})$ satisfies the above condition.

Now, since the complement of an independent set is a vertex cover we can reformulate the above problem as a minimum weight (by transforming the inequality condition) vertex cover problem, which can be solved on bipartite graphs in $O(n^3)$. Further analysis of the algorithm steps yields that the total runtime to find the geodesic is $O(n^4)$. For the exact details of the proof of correctness of the algorithm and the runtime I suggest reading the original manuscript by Owen and Provan, 2010.

Data/code availability

This post has no associated custom code or data. Those looking for an implementation of the algorithm described should check out Megan Owen’s webpage or GitHub repo.

What’s next

This post concludes the beginning of our exploration of the geometry of the space of phylogenetic trees. There are several ways we can proceed from here.

First, we can finally start tackling the foundational paper of Billera, Holmes and Vogtmann, 2001. This path will likely require us to spend a few posts dissecting original BHV space and some associated results from metric geometry and CAT(0) spaces. We then can return to the paper of Gavryushkin and Drummond, 2016 and explore the t-space construction in a way similar to our exploration of the 𝛕-space. Finally, we can then continue our journey by catching up to more recent work by Megan Owen et al. on defining and computing statistics in BHV space (Brown and Owen, 2020), comparing phylogenetic trees with different leaf sets (Grindstaff and Owen, 2018), and extending the framework to phylogenetic cactuses (a type of phylogenetic network, Huber et al., 2021).

Second, we can instead pivot completely and ask a general question of what other geometries one can consider on the space of phylogenetic trees. In particular, we can take a look at algebraic geometry inspired view of the phylogenetic tree space as a tropical geometry (Monod et al., 2018), and an information geometry angle which gives rise to wald space (pun credit to the authors) of the phylogenetic trees (Garba et al., 2021). I will most likely skip the category theoretic angle for the topic (Baez and Otter, 2015), since frankly speaking I have no desire to introduce the notion of an operad.

Third, we can temporarily halt the process of investigating deeper into the various geometries on the tree space and ask more broadly what are geometries of interest in the case of phylogenetic networks. This route will require us to go back to some definitions of the phylogenetic network and related concepts, and will likely demand at least some amount of rehashing of the biological context.

Since, I am still debating which route is of the most interest to me at the moment, next week’s post will likely be sparser than usual, as I am figuring out which rabbit hole to pick as the next destination. However, if you actually would prefer me to pick one of these three directions do not hesitate to comment or reach out to me via other means of contact.

References

Baez, John C., and Nina Otter. “Operads and phylogenetic trees.” arXiv preprint arXiv:1512.03337 (2015). arXiv:1512.03337

Billera, Louis J., Susan P. Holmes, and Karen Vogtmann. “Geometry of the space of phylogenetic trees.” Advances in Applied Mathematics 27, no. 4 (2001): 733-767. https://doi.org/10.1006/aama.2001.0759

Bridson, Martin R., and André Haefliger. Metric spaces of non-positive curvature. Vol. 319. Springer Science & Business Media, 2013. https://doi.org/10.1007/978-3-662-12494-9

Brown, Daniel G., and Megan Owen. “Mean and variance of phylogenetic trees.” Systematic biology 69, no. 1 (2020): 139-154. https://doi.org/10.1093/sysbio/syz041, arXiv:1708.00294

Garba, Maryam Kashia, Tom MW Nye, Jonas Lueg, and Stephan F. Huckemann. “Information geometry for phylogenetic trees.” Journal of Mathematical Biology 82, no. 3 (2021): 1-39. https://doi.org/10.1007/s00285-021-01553-x, arXiv:2003.13004

Gavryushkin, Alex, and Alexei J. Drummond. “The space of ultrametric phylogenetic trees.” Journal of theoretical biology 403 (2016): 197-208. arXiv:1410.3544

Grindstaff, Gillian, and Megan Owen. “Geometric comparison of phylogenetic trees with different leaf sets.” arXiv preprint arXiv:1807.04235 (2018). arXiv:1807.04235

Huber, Katharina T., Vincent Moulton, Megan Owen, Andreas Spillner, and Katherine St John. “The space of equidistant phylogenetic cactuses.” arXiv preprint arXiv:2111.06115 (2021). arXiv:2111.06115

Monod, Anthea, Bo Lin, Ruriko Yoshida, and Qiwen Kang. “Tropical geometry of phylogenetic tree space: a statistical perspective.” arXiv preprint arXiv:1805.12400 (2018). arXiv:1805.12400

Owen, Megan. “Computing geodesic distances in tree space.” SIAM Journal on Discrete Mathematics 25, no. 4 (2011): 1506-1529. arXiv:0903.0696

Owen, Megan, and J. Scott Provan. “A fast algorithm for computing geodesic distances in tree space.” IEEE/ACM Transactions on Computational Biology and Bioinformatics 8, no. 1 (2010): 2-13. https://doi.org/10.1109/TCBB.2010.3