In an effort to learn more about phylogenetics and practice my writing skills I am starting a mini-series of phylogenetic vignettes in which I will explore different concepts in phylogenetic analysis and their implementations. This series is intended as a partial account of my journey in learning some of the topics in phylogenetics. The resulting posts are not intended to be complete explorations of the corresponding questions, and will be presented at the current level of author’s understanding which can be incomplete. The intended audience for these vignettes are undergraduate and graduate students with solid understanding of basic concepts in discrete mathematics, as well as some knowledge of analysis and probability. I will assume no deep knowledge of evolutionary biology, in part due to my own background lacking strong biology foundation.

## Preliminaries

We will follow the notation adopted in Huson, Rupp, and Scornavacca, 2010.

Notation | Description |

a finite set of taxa (for non-bio folks: a set of labels) | |

a unrooted phylogenetic tree, i.e. a undirected connected graph with no cycles and in which together with a leaf labeling that assigns exactly one taxon to each leaf and none to any internal node | |

a weight assignment for a given phylogenetic tree , i.e. a map (in practice we are interested only in the case ) | |

a pairwise distance matrix for the set of genomic sequences corresponding to the set of taxa |

We will also use the term *molecular clock *to refer to the situations in which the rate of mutation is fixed per unit of time. Thus the *molecular clock hypothesis* is a useful assumption for inference of trees for the sequences that were collected at the same point in time, hence we can assume in that scenario that in the rooted phylogenetic tree distance from root to every leaf is constant.

## Distance based tree inference

In general, **phylogenetic tree inference problem** consists of inferring an optimal rooted or unrooted phylogenetic tree given a set of sequences representing taxa . Here by optimal we mean optimization with respect to some evolutionary model over the full parameter space defined by topologies of and weights .

The exact criterion with respect to which the optimization is done varies between various formulations of the problem. However, one thing that is common to all formulations is the prohibitively large size of the search space. A simple calculation shows that the number of bifurcating trees (i.e. every non-leaf node has exactly two children or alternatively every non-leaf has degree 3) on taxa is .

In the case of distance based phylogenetic tree inference our input is the pairwise distance matrix . We will explore different options for measuring distance between input sequences in a later section.

## Least squares

We can start with an attempt at defining an optimization goal for our tree inference problem given the input distance matrix . Intuitively the goal here is to find a tree that best approximates (explains) the observed distances. More formally, given a phylogenetic tree with edge weights we can define the tree-induced metric on the set of taxa as , where we use the notation to denote the path between leaves of the tree (in a tree any two nodes are connected by a single path). Thus, we want to minimize the error under some norm. One possible choice of the norm is the Frobenius norm, which in turn corresponds to unweighted least squares tree inference. As noted in Felsenstein, 2004 the least squares formulation can be weighted, i.e. the function to minimize becomes with the choice proposed by Cavalli-Sforza and Edwards, 1967, by Fitch and Margoliash, 1967, and by Beyer et al., 1974.

While for a fixed tree topology the problem boils down to least squares optimization, the general problem of tree inference under this model requires searching over the large space of trees. Thus, while the formulation behind this method is sound and appealing, in practice we tend to rely on heuristics when searching the tree space.

Instead of fully exploring least squares method and its derivatives and search heuristics we will now pivot to a different approach in which instead of explicit optimization under a given metric we will instead focus on direct clustering goal.

## UPGMA

The first approach we will explore is called Unweighted Pair Group Method with Arithmetic Mean (UPGMA). For those familiar with clustering methods, and hierarchical clustering in particular, this name might ring a bell. It appears that the name UPGMA mostly stuck with the phylogenetics community, and in the general clustering world this approach is referred to as agglomerative (i.e. bottom-up) clustering with average linkage (scikit-learn reference, scipy reference).

Implementing UPGMA naively is straightforward and we will illustrate it in the following code snippet with the distance matrix data coming from Felsenstein, 2004 (pp. 162-166).

Note that the above implementation doesn’t keep track of the branch lengths when providing the output and as a result our tree has the correct topology but misses the edge weight information. Alternatively we can simply use Scipy hierarchical clustering module and achieve the same result including the branch lengths in fewer lines of code.

One thing that becomes apparent from the second plot is that the resulting tree is an *ultrametric tree*.

In general we call a metric space and *ultrametric space* if in addition to the standard properties the metric satisfies the strong triangle inequality for all . Ultrametric spaces are somewhat bizarre looking since every triangle in such space is isosceles. Furthermore, every point inside an open ball of positive radius is its center, and all open balls of positive radius are clopen. However, *ultrametric spaces* arise naturally in several settings with the one of interest to us being phylogenetic trees under the molecular clock hypothesis. Such trees are commonly referred to as *ultrametric trees*.

Now, one of the interesting properties of the UPGMA algorithm is that the tree constructed by this method will always be ultrametric. However, this also implies that if the original distance matrix does not induce an ultrametric on then the tree produced by the UPGMA algorithm will have a non-zero error . For example, consider the following sample distance matrix given as an example in Huson, Rupp, and Scornavacca, 2010 (p. 53):

A | B | C | D | E | |

A | 0 | 2 | 7 | 5 | 4 |

B | 2 | 0 | 7 | 5 | 4 |

C | 7 | 7 | 0 | 4 | 5 |

D | 5 | 5 | 4 | 0 | 3 |

E | 4 | 4 | 5 | 3 | 0 |

The inferred UPGMA tree will look as following

Due to the metric induced by the tree the distances between **C** and **D**, and **C** and **E**, will be equal in while we know from the original distance matrix that the distance from **C** to **D** is supposed to be larger. Thus, UPGMA model is not broadly applicable since it does require the original distance matrix to satisfy the ultrametric condition. Conversely, Huson, Rupp, and Scornavacca, 2010 ** Lemma 3.13.2** shows that if the distance matrix is ultrametric then the tree inferred by UPGMA satisfies .

In terms of the computational complexity UPGMA requires time, which is a significant speed up when compared to brute-force search over the tree space.

Felsenstein, 2004 notes that in some scenarios, for example for closely related species, where the *molecular clock* is a reasonable starting assumption the method beahves well. However, in the general setting where we cannot claim a fixed rate for the whole tree UPGMA approach becomes problematic.

## Neighbor-joining

Neighbor-joining is an alternative method for inferring a phylogenetic tree from the distance matrix . Originally it was proposed by Saitou and Nei, 1987 and has since been implemented in several libraries and software packages. We will use the implementation provided in scikit-bio (in case if you get a conflict with Scipy, try downgrading Scipy to version 1.5.2), and skip discussing the details of the algorithm in this post.

Motivation for considering neighbor-joining is that it doesn’t require the underlying distance matrix to be ultrametric in order to faithfully infer a phylogenetic tree. However, in order to guarantee that the algorithm will infer a tree with we still require that the distance matrix is “tree-like”. This vague term can be made precise via the ** Definition 3.12.4** from Huson, Rupp, and Scornavacca, 2010 (p. 51) also referred to as the four-point condition: a distance matrix is said to satisfy the four-point condition if for every quadruple of taxa we have . In the case when the distance matrix satisfies the above condition we can guarantee that for the inferred tree . The proof of the above claim can be found in the original paper.

Neighbor-joining has the same computational complexity as UPGMA, and both methods are agglomerative clustering approaches that differ in the distance and update rules. One interesting aspect of the neighbor joining method is that it has been shown to greedily optimize *balanced minimum evolution* metric on the tree space (Gascuel and Steel, 2006). We will not explore the details of the *balanced minimum evolution* model at this point, but we might return to the topic in later post.

## Exploring sequence distance metrics

So far we only considered examples in which we had a predefined distance matrix provided to us from the beginning. In this section we will explore a few choices for constructing given a set of genomic sequences and compare how different distance based inference methods fair in these scenarios.

### Jukes-Cantor model and distance

The simplest substitution model used in phylogenetics is the Jukes-Cantor model. Under this model we assume that the substitution rate is fixed and equal and the base frequencies are all equal to 1/4. Under these assumptions it is easy to derive the correspondence between the expected number of substitutions (i.e. evolutionary distance) and the fraction of distinct bases in the alignment (i.e. normalized Hamming distance). Thus, given two aligned sequences we will refer to the quantity , where is the normalized Hamming distance, as the Jukes-Cantor distance between and .

### Mash distance

MinHash sketching is an effective computational technique for estimating similarity of two sets of data. Due to its computational efficiency MinHash has found its into many data intensive areas of computer science, including bioinformatics. In particular, due to the computational efficiency it’s common that we view sequences as sets (or multi-sets) of *k-mers* (substrings of length *k*, think *k-grams* in NLP), and hence we can compare the (Jaccard) similarity of two sequences via MinHash on the respective *k-mers*. However, upon closer inspection it is clear that Jaccard distance is not a good proxy for the evolutionary distance (similarly how normalized Hamming distance is not exactly the same thing as Jukes-Cantor distance). Thus, we want to compute an appropriate adjustment that turns our estimated Jaccard index into an approximate evolutionary metric. One possible approach is given by Ondov et al., 2006, Eq. 4 which defines Mash distance between two sequences (or sets of sequences) as where is the length of the *k-mer* and is the estimated Jaccard index.

### Comparing UPGMA and Neighbor-joining for JC distance

We will begin with a set of 5 sequences each consisting of 70 nucleotides and containing 5 single nucleotide changes with respect to the first 70 nucleotides of the N gene of the reference genome for SARS-CoV-2.

Reference: ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAA Sequences: ATTGCTGATACTGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGAACCTCAGATCCAA ATGTTTGATAATGGACCCCAAATTCAGCGAAATGCACGCCGCATTACGTTCGGTGGACCCTCAGATTCCA ATGTTTGATAATGGACCCCAAATTCAGCGAAATGCACCCTGCATTACGTTCGGTGGACCCTCAGATTCGA ATGTCCGATAATGGACCCCAAATTCAGCGAAATGCACTCCGCATTACGTTTGATGGACCCTCAGATTCCA ATGTCCGATAATGGACCCCAAATTCAGCGAAATGCACCCGGCATTACGTTTGATGGACCCTCAGATCCAA

We will label the sequences in this set A, B, C, D, and E in top to bottom order. We get the following matrix of Jukes-Cantor distances for this set:

A | B | C | D | E | |

A | 0 | 0.158 | 0.158 | 0.158 | 0.124 |

B | 0.158 | 0 | 0.044 | 0.075 | 0.124 |

C | 0.158 | 0.044 | 0 | 0.107 | 0.107 |

D | 0.158 | 0.075 | 0.107 | 0 | 0.059 |

E | 0.124 | 0.124 | 0.107 | 0.059 | 0 |

It is easy to check that this matrix is not additive, and hence is also not an ultrametric. Thus, we do not expect either of the algorithms to yield an optimal tree. Nevertheless we would like to compare the two topologies we get. First we will consider the UPGMA tree:

We can see that the sequences A ends up acting as an outgroup, which is not surprising if we return and examine the mutation content within the sequence. Sequences B and C, and D and E form two clades in the UPGMA tree. Examining the alignment by eye it appears as a plausible grouping. However, we note that the distance between sequences B and D is estimated to be smaller than that between B and E, while the inferred tree does not capture this difference.

Next, we will consider the tree obtained from the neighbor-joining method.

Similarly to the UPGMA tree the sequences B and C end up in the same clade in the neighbor-joining tree. Analogously sequence A appears to form an outgroup. The key difference in this case is the way in which sequences D and E are placed onto the tree. However, neighbor-joining tree also fails to capture some of the differences in the distances, for example sequence E is equidistant from B and C in the tree.

### Comparing UPGMA and Neighbor-joining for Mash distance

We will start out with the same sequence setup as in the previous run, except instead of the Jukes-Cantor distances we will use Mash distances this time. We are setting the *k-mer* size to 7 for this scenario, and we obtain the following distance matrix:

A | B | C | D | E | |

A | 0 | 0.158 | 0.158 | 0.144 | 0.121 |

B | 0.158 | 0 | 0.027 | 0.057 | 0.079 |

C | 0.158 | 0.027 | 0 | 0.071 | 0.071 |

D | 0.144 | 0.057 | 0.071 | 0 | 0.034 |

E | 0.121 | 0.079 | 0.071 | 0.034 | 0 |

Unsurprisingly this matrix is not additive either, and the distance values tend to correlate with the Jukes-Cantor distances. Thus, as expected both inferred trees have the same topologies as in the previous section, albeit slightly different branch lengths.

Now, we will try to perform a similar comparison but with a different underlying dataset. We downloaded 454 complete SARS-CoV-2 genomes collected between August 1st, 2022 and August 15th, 2022 from NCBI Virus database. We then computed the Mash distances between all these sequences using 11-, 17-, 23,-, and 31-mers. Finally for each distance matrix we inferred the phylogenetic tree using UPGMA and neighbor-joining methods.

In the above dendrogram we truncated subtrees after a fixed depth level for the ease of visualization, the labels when present indicate a leaf node with the corresponding PANGO lineage assigned to it. One thing that we immediately notice in this case is that BA.2, BA.5 (and BA.4, labels not shown) form a clade, and B.1.617.2, AY.* form a clade. This matches the expected behavior since, the first clade corresponds to the BA.2-related sublineages of Omicron variant, and the second clade corresponds to Delta variant. What is slightly surprising is that BA.1 (also a sublineage of Omicron) ends up in a clade with B.1.616 and B.1 lineages, rather than next to BA.2. We also inferred a neighbor-joining based tree and adopted leaf coloring scheme similar to the one in UPGMA.

In this tree the deeper shade of orange represents BA.5 and its sublineages, while lighter shade of orange corresponds to BA.2 and BA.4 sublineages of Omicron. Green is used for Delta, and red for everything else. One thing that becomes more apparent with this tree is that there is subclade consisting predominantly of BA.5 sequences that is at a considerable distance from the rest of the sequences in the tree.

Another interesting observation we get from this experiment is that the *k-mer* size used did not influence the resulting tree shape drastically. While there are some minor changes occurring as we increase the size of *k*, the main clades centered around BA.5 and Delta emerge in all inferred trees. This suggests a practical observation that in the case of a large amount of closely related sequences, we can apply very approximate methods (since we stacked Mash distance with UPGMA we effectively “approximated” twice) and still capture some underlying structure in the data.

## Data and code availability

In addition to the two Gists shared above you can also find the Jupyter Notebook with the code used to generate figures for this post, as well as sequence data and precomputed Mash distances in the archive attached below.

## What’s next

We very briefly touched upon some of the distance-based phylogenetic inference methods and checked out several trees inferred from two evolutionary distance metrics. However, we haven’t rigorously defined how exactly two tree topologies and two phylogenetic trees can be compared. We will dive into that topic over the course of the next few posts.

## References

Beyer, William A., Myron L. Stein, Temple F. Smith, and Stanislaw M. Ulam. “A molecular sequence metric and evolutionary trees.” *Mathematical Biosciences* 19, no. 1-2 (1974): 9-25. https://doi.org/10.1016/0025-5564(74)90028-5

Cavalli-Sforza, L.L. and Edwards, A.W.F. (1967), *Phylogenetic analysis: Models and estimation procedures*. Evolution, 21: 550-570. https://doi.org/10.1111/j.1558-5646.1967.tb03411.x

Felsenstein, Joseph. *Inferring phylogenies*. Vol. 2. Sunderland, MA: Sinauer associates, 2004.

Fitch, Walter M., and Emanuel Margoliash. “Construction of phylogenetic trees.” *Science* 155, no. 3760 (1967): 279-284. https://doi.org/10.1126/science.155.3760.279

Gascuel, Olivier, and Mike Steel. “Neighbor-joining revealed.” *Molecular biology and evolution* 23, no. 11 (2006): 1997-2000. https://doi.org/10.1093/molbev/msl072

Haeckel, E. H. P. A. (1866).*Generelle Morphologie der Organismen : allgemeine Grundzüge der organischen Formen-Wissenschaft, mechanisch begründet durch die von C. Darwin reformirte Decendenz-Theorie.* Berlin. [Featured image]

Huson, Daniel H., Regula Rupp, and Celine Scornavacca. *Phylogenetic networks: concepts, algorithms and applications.* Cambridge University Press, 2010.

Ondov, Brian D., Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. “Mash: fast genome and metagenome distance estimation using MinHash.” *Genome biology* 17, no. 1 (2016): 1-14. https://doi.org/10.1186/s13059-016-0997-x

Saitou, Naruya, and Masatoshi Nei. “The neighbor-joining method: a new method for reconstructing phylogenetic trees.” *Molecular biology and evolution* 4, no. 4 (1987): 406-425. https://doi.org/10.1093/oxfordjournals.molbev.a040454