Travelogue: ASM NGS 2022, Baltimore

On October 16th around 6 AM the plane taking me to Baltimore, Maryland took off from the Hobby airport in Houston. I generally do not mind very early morning flights, and since I still retain the magical (and much envied) ability to sleep on the planes the commute was fine. I arrive in BWI around 10 AM and after picking up my luggage headed directly for the Sheraton Inner Harbor hotel. The hotel did not have an early check-in option available, so I dropped off my bags and went for a walk. I found a neat coffee shop and spent a bit of time catching up on some reading.

Inner harbor in Baltimore, Maryland, around 10:45 AM. Not the best shot, but I managed to catch a bird in this one, so I am posting it.
Inner harbor in Baltimore, Maryland, still around 10:45 AM. Still not the best shot, but also I didn’t take any particularly good ones this time. I guess being awake since 3 AM ain’t helpful for my artistic eye.
Washington monument at Mt. Vernon in Baltimore, Maryland, circa noon. Quality of the shots improves as the day goes on.

After the coffee break, I met up with Evan Mata and we caught up on school life, travel stories, and I finally got my oyster fix. Two key takeaways from our lunch were as follows: (1) we both wish we took more math courses in college (and in fact this appears to be a common trend for many of my college friends), and (2) Maryland oysters are still delicious, albeit not as cheap as I remembered.

After the lunch I made it back to the hotel where I finally could check in and catch my breath before the opening keynotes for the ASM NGS 2022. Besides the keynotes, and a brief chat over dinner the first day was not jam packed with activities, so I was able to crash early and catch up on some of the sleep that I missed due to the early flight.

Next day I attended a series of the talks in the morning, after which I headed for the University of Maryland, College Park. UMD campus is quite expansive, and in some strange way it reminded me of the University of Tennessee in Knoxville. The computer science building at UMD is quite impressive both in size and design, and many offices tend to have nice floor to ceiling windows, which is an awesome touch. I did not have too much time to explore the campus, as the visit was relatively tightly packed with meetings, and as soon as those were wrapped up I headed back to Baltimore.

E. A. Fernandez IDEA Factory on the University of Maryland College Park campus. The only shot I took at the UMD campus (sadly, since the campus is quite pretty IMHO).

Although, I unfortunately missed most of the afternoon talks on the first day, I still managed to catch the poster session. In general, I am a big fan of poster sessions. The main benefit of a poster (as opposed to a talk) is the ability to ask a multitude of questions about the nitty-gritty details of the work without feeling guilty of encroaching on other people’s time. I found a good number of quite interesting posters relating to SARS-CoV-2, bacteria and horizontal gene transfer, and even a poster on a speed up for neighbor-joining algorithm. After the poster session I briefly caught up with Mike Nute over dinner which importantly included a crab cake (another checklist food item for me on this trip).

Grabbing dinner with Mike Nute at Phillips Seafood in Baltimore, Maryland. Quite solid crab cake and overall good spread of seafood. Very enjoyable dinner option not too far from the conference hotel location.

The crab cake ended up being quite good, although I have a feeling that with a bit of effort I can probably make a comparable if not a better one at home. However, crab meat is not cheap to come by, so I am not likely to test that hypothesis any time in the near future.

Tuesday was a full conference day for me. I attended all of the talks that day, and tried to keep a list of close notes for most of them. Unfortunately the tables at the conference did not have any outlets or power strips connecting to them or running along the floor. Thus, my note taking was cut short towards the second half of the afternoon, as my laptop ran out of its battery.

The last talk session of they day was a wastewater focused mini-symposium. The talks in this session were quite interesting, and indeed reflected a lot of my experiences and concerns when dealign with wastewater data. Namely, issues of the data quality, metadata annotation, and general complexity of the wastewater samples, which in certain cases ends up being overlooked. I was very happy that I was able to give a talk in this session as well. I was presenting the joint work between Treangen and Stadler labs at Rice University in conjunction with the Houston Health Department. My talk focused on QuaID: a software package we developed for sensitive detection of recently emerged SARS-CoV-2 variants in wastewater sequencing data (preprinted at medRxiv).

My first conference talk looked something like this. ASM NGS 2022, Day 2, wastewater session.
Some scallops (which were yummy) at the Watershed in Baltimore, Maryland. Post talk dinner feels good, although sitting on a rooftop terrace might not have been the best decision given it was in the low 50 F range?

Overall the talk went very well, and I enjoyed the experience. Of course, as any public performance giving the talk came with a fair deal of stage fright and nervousness. However, extensive practice runs with my labmates, adviser, and collaborators helped a lot in mitigating anxiety, and polishing the presentation content. As per usual, I do think that several moments could have been executed better on my part, but I’ll hope to use that as the knowledge for preparing my next presentations.

The evening continued on into another poster session with a fair number of exciting wastewater and clinical SARS-CoV-2 work, as well as some benchmarking studies of interest. I felt that the second poster session was a tad bit more lively than the first one, but it also could have been an artifact of my post-talk state. After the poster session, I joined a large-ish group of attendees for dinner and chats. Scallops were not on my Baltimore food checklist, but among the seafood options offered at the Watershed they stood out as an interesting choice. Overall I enjoyed them, although next time I will probably go for the crab cakes.

The last day of the conference was relatively short with only a morning session of the talks. There was a plenty of interesting talks this day, and I was quite pleasantly surprised to hear several talks involving deep learning in a principled and well motivated way. The only drawback of this session was the amount of lightning talks given back to back. It was at times hard to properly note down all the details, and due to the format and the fact that this was the last day it was harder to follow up with the presenters to discuss some of the finer details of their work. However, I still managed to take a good amount of notes, and expand my “papers to read” list by a dozen or so manuscripts.

After the conference, I checked out of the hotel and headed to our last stop before the airport: Johns Hopkins University. First we stopped by the medical campus and grabbed a lunch, which consisted of an absolutely wonderful lamb stew with rice. Then we moved to the Homewood campus pictures of which are attached below.

Afghani food at Kabobi in Baltimore, Maryland near JHU Medical Campus. Amazing lamb stew with veggies, a very hearty and comfortable meal.

Being on JHU Homewood campus reminded me of one of the greatest pleasures in the fall season: observing the colorful trees in a light cool breeze. However, similarly to the UMD visit, I did not have much time to wander around and observe the trees, as we mainly focused on the scheduled meetings.

At the end of the trip I was absolutely exhausted, but overall happy with the experience. I managed to tick off several of the key food checklist items on this visit. Of course more importantly, I really enjoyed attending the ASM NGS conference, and finally being able to experience an in-person conference. Giving a talk was a huge added benefit, and all the meetings, lunches, and dinners with multiple new and old acquaintances and collaborators (some of whom I finally got to see in person) made this a very memorable trip.

Getting back to Houston late at night I was reminded that this is a spooky season 🎃!

Phylogenetic Vignettes: UPGMA and Neighbor-joining

In an effort to learn more about phylogenetics and practice my writing skills I am starting a mini-series of phylogenetic vignettes in which I will explore different concepts in phylogenetic analysis and their implementations. This series is intended as a partial account of my journey in learning some of the topics in phylogenetics. The resulting posts are not intended to be complete explorations of the corresponding questions, and will be presented at the current level of author’s understanding which can be incomplete. The intended audience for these vignettes are undergraduate and graduate students with solid understanding of basic concepts in discrete mathematics, as well as some knowledge of analysis and probability. I will assume no deep knowledge of evolutionary biology, in part due to my own background lacking strong biology foundation.

Preliminaries

We will follow the notation adopted in Huson, Rupp, and Scornavacca, 2010.

NotationDescription
\mathcal{X}a finite set of taxa (for non-bio folks: a set of labels)
Ta unrooted phylogenetic tree, i.e. a undirected connected graph G=(V, E) with no cycles and in which \mathrm{deg}(v)\neq 2, \forall v\in V together with a leaf labeling \lambda:\mathcal{X}\to V that assigns exactly one taxon to each leaf and none to any internal node
\omegaa weight assignment for a given phylogenetic tree T=(V, E, \mathcal{X}, \lambda), i.e. a map \omega: E\to \mathbb{R} (in practice we are interested only in the case \omega: E\to (0,\infty))
Da pairwise distance matrix for the set of genomic sequences corresponding to the set of taxa \mathcal{X}

We will also use the term molecular clock to refer to the situations in which the rate of mutation is fixed per unit of time. Thus the molecular clock hypothesis is a useful assumption for inference of trees for the sequences that were collected at the same point in time, hence we can assume in that scenario that in the rooted phylogenetic tree distance from root \rho to every leaf is constant.

Distance based tree inference

In general, phylogenetic tree inference problem consists of inferring an optimal rooted or unrooted phylogenetic tree T given a set of sequences representing taxa \mathcal{X}. Here by optimal we mean optimization with respect to some evolutionary model over the full parameter space defined by topologies of T and weights \omega.

The exact criterion with respect to which the optimization is done varies between various formulations of the problem. However, one thing that is common to all formulations is the prohibitively large size of the search space. A simple calculation shows that the number of bifurcating trees (i.e. every non-leaf node has exactly two children or alternatively every non-leaf has degree 3) on n taxa is (2n-3)!!.

In the case of distance based phylogenetic tree inference our input is the n\times n pairwise distance matrix D. We will explore different options for measuring distance between input sequences in a later section.

Least squares

We can start with an attempt at defining an optimization goal for our tree inference problem given the input distance matrix D. Intuitively the goal here is to find a tree that best approximates (explains) the observed distances. More formally, given a phylogenetic tree T with edge weights \omega we can define the tree-induced metric on the set of taxa \mathcal{X} as d_T(x, y)=\sum_{e\in x\sim y}\omega(e), where we use the notation x\sim y to denote the path between leaves x, y of the tree (in a tree any two nodes are connected by a single path). Thus, we want to minimize the error \parallel D-D_T\parallel under some norm. One possible choice of the norm is the Frobenius norm, which in turn corresponds to unweighted least squares tree inference. As noted in Felsenstein, 2004 the least squares formulation can be weighted, i.e. the function to minimize becomes \sum_{i=1}^n\sum_{j=1}^nw_{ij}(D_{ij}-{D_T}_{ij})^2 with the choice w_{ij}=1 proposed by Cavalli-Sforza and Edwards, 1967, w_{ij}=1/D^2_{ij} by Fitch and Margoliash, 1967, and w_{ij}=1/D_{ij} by Beyer et al., 1974.

While for a fixed tree topology the problem boils down to least squares optimization, the general problem of tree inference under this model requires searching over the large space of trees. Thus, while the formulation behind this method is sound and appealing, in practice we tend to rely on heuristics when searching the tree space.

Instead of fully exploring least squares method and its derivatives and search heuristics we will now pivot to a different approach in which instead of explicit optimization under a given metric we will instead focus on direct clustering goal.

UPGMA

The first approach we will explore is called Unweighted Pair Group Method with Arithmetic Mean (UPGMA). For those familiar with clustering methods, and hierarchical clustering in particular, this name might ring a bell. It appears that the name UPGMA mostly stuck with the phylogenetics community, and in the general clustering world this approach is referred to as agglomerative (i.e. bottom-up) clustering with average linkage (scikit-learn reference, scipy reference).

Implementing UPGMA naively is straightforward and we will illustrate it in the following code snippet with the distance matrix data coming from Felsenstein, 2004 (pp. 162-166).

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Note that the above implementation doesn’t keep track of the branch lengths when providing the output and as a result our tree has the correct topology but misses the edge weight information. Alternatively we can simply use Scipy hierarchical clustering module and achieve the same result including the branch lengths in fewer lines of code.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

One thing that becomes apparent from the second plot is that the resulting tree is an ultrametric tree.

In general we call a metric space (M, d) and ultrametric space if in addition to the standard properties the metric d satisfies the strong triangle inequality d(x, y)\leq\mathrm{max}\{d(x, z), d(y, z)\} for all x, y, z\in M. Ultrametric spaces are somewhat bizarre looking since every triangle in such space is isosceles. Furthermore, every point inside an open ball of positive radius is its center, and all open balls of positive radius are clopen. However, ultrametric spaces arise naturally in several settings with the one of interest to us being phylogenetic trees under the molecular clock hypothesis. Such trees are commonly referred to as ultrametric trees.

Now, one of the interesting properties of the UPGMA algorithm is that the tree constructed by this method will always be ultrametric. However, this also implies that if the original distance matrix does not induce an ultrametric on \mathcal{X} then the tree produced by the UPGMA algorithm will have a non-zero error \parallel D-D_T\parallel. For example, consider the following sample distance matrix given as an example in Huson, Rupp, and Scornavacca, 2010 (p. 53):

ABCDE
A02754
B20754
C77045
D55403
E44530

The inferred UPGMA tree will look as following

Due to the metric induced by the tree the distances between C and D, and C and E, will be equal in T while we know from the original distance matrix that the distance from C to D is supposed to be larger. Thus, UPGMA model is not broadly applicable since it does require the original distance matrix to satisfy the ultrametric condition. Conversely, Huson, Rupp, and Scornavacca, 2010 Lemma 3.13.2 shows that if the distance matrix D is ultrametric then the tree inferred by UPGMA satisfies D_T=D.

In terms of the computational complexity UPGMA requires O(|\mathcal{X}|^3|) time, which is a significant speed up when compared to brute-force search over the tree space.

Felsenstein, 2004 notes that in some scenarios, for example for closely related species, where the molecular clock is a reasonable starting assumption the method beahves well. However, in the general setting where we cannot claim a fixed rate for the whole tree UPGMA approach becomes problematic.

Neighbor-joining

Neighbor-joining is an alternative method for inferring a phylogenetic tree from the distance matrix D. Originally it was proposed by Saitou and Nei, 1987 and has since been implemented in several libraries and software packages. We will use the implementation provided in scikit-bio (in case if you get a conflict with Scipy, try downgrading Scipy to version 1.5.2), and skip discussing the details of the algorithm in this post.

Motivation for considering neighbor-joining is that it doesn’t require the underlying distance matrix to be ultrametric in order to faithfully infer a phylogenetic tree. However, in order to guarantee that the algorithm will infer a tree T with D_T=D we still require that the distance matrix is “tree-like”. This vague term can be made precise via the Definition 3.12.4 from Huson, Rupp, and Scornavacca, 2010 (p. 51) also referred to as the four-point condition: a distance matrix D is said to satisfy the four-point condition if for every quadruple of taxa x,y,z,w\in\mathcal{X} we have d(w, x)+d(y, z)\leq \mathrm{max}\{d(w,y)+d(x,z), d(w,z)+d(x,y)\}. In the case when the distance matrix D satisfies the above condition we can guarantee that D_T=D for the inferred tree T. The proof of the above claim can be found in the original paper.

Neighbor-joining has the same computational complexity as UPGMA, and both methods are agglomerative clustering approaches that differ in the distance and update rules. One interesting aspect of the neighbor joining method is that it has been shown to greedily optimize balanced minimum evolution metric on the tree space (Gascuel and Steel, 2006). We will not explore the details of the balanced minimum evolution model at this point, but we might return to the topic in later post.

Exploring sequence distance metrics

So far we only considered examples in which we had a predefined distance matrix D provided to us from the beginning. In this section we will explore a few choices for constructing D given a set of genomic sequences and compare how different distance based inference methods fair in these scenarios.

Jukes-Cantor model and distance

The simplest substitution model used in phylogenetics is the Jukes-Cantor model. Under this model we assume that the substitution rate is fixed and equal and the base frequencies are all equal to 1/4. Under these assumptions it is easy to derive the correspondence between the expected number of substitutions (i.e. evolutionary distance) and the fraction of distinct bases in the alignment (i.e. normalized Hamming distance). Thus, given two aligned sequences x, y we will refer to the quantity JC(x, y)=-\frac{3}{4}\ln\left(1-\frac{4}{3}H(x, y)\right), where H(x, y) is the normalized Hamming distance, as the Jukes-Cantor distance between x and y.

Mash distance

MinHash sketching is an effective computational technique for estimating similarity of two sets of data. Due to its computational efficiency MinHash has found its into many data intensive areas of computer science, including bioinformatics. In particular, due to the computational efficiency it’s common that we view sequences as sets (or multi-sets) of k-mers (substrings of length k, think k-grams in NLP), and hence we can compare the (Jaccard) similarity of two sequences via MinHash on the respective k-mers. However, upon closer inspection it is clear that Jaccard distance is not a good proxy for the evolutionary distance (similarly how normalized Hamming distance is not exactly the same thing as Jukes-Cantor distance). Thus, we want to compute an appropriate adjustment that turns our estimated Jaccard index into an approximate evolutionary metric. One possible approach is given by Ondov et al., 2006, Eq. 4 which defines Mash distance between two sequences (or sets of sequences) as D=-\frac{1}{k}\ln\frac{2j}{1+j} where k is the length of the k-mer and j is the estimated Jaccard index.

Comparing UPGMA and Neighbor-joining for JC distance

We will begin with a set of 5 sequences each consisting of 70 nucleotides and containing 5 single nucleotide changes with respect to the first 70 nucleotides of the N gene of the reference genome for SARS-CoV-2.

Reference:
ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAA
Sequences:
ATTGCTGATACTGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGAACCTCAGATCCAA
ATGTTTGATAATGGACCCCAAATTCAGCGAAATGCACGCCGCATTACGTTCGGTGGACCCTCAGATTCCA
ATGTTTGATAATGGACCCCAAATTCAGCGAAATGCACCCTGCATTACGTTCGGTGGACCCTCAGATTCGA
ATGTCCGATAATGGACCCCAAATTCAGCGAAATGCACTCCGCATTACGTTTGATGGACCCTCAGATTCCA
ATGTCCGATAATGGACCCCAAATTCAGCGAAATGCACCCGGCATTACGTTTGATGGACCCTCAGATCCAA

We will label the sequences in this set A, B, C, D, and E in top to bottom order. We get the following matrix of Jukes-Cantor distances for this set:

ABCDE
A00.158 0.1580.1580.124
B0.15800.0440.0750.124
C0.1580.04400.1070.107
D0.1580.0750.10700.059
E0.1240.1240.1070.0590

It is easy to check that this matrix is not additive, and hence is also not an ultrametric. Thus, we do not expect either of the algorithms to yield an optimal tree. Nevertheless we would like to compare the two topologies we get. First we will consider the UPGMA tree:

We can see that the sequences A ends up acting as an outgroup, which is not surprising if we return and examine the mutation content within the sequence. Sequences B and C, and D and E form two clades in the UPGMA tree. Examining the alignment by eye it appears as a plausible grouping. However, we note that the distance between sequences B and D is estimated to be smaller than that between B and E, while the inferred tree does not capture this difference.

Next, we will consider the tree obtained from the neighbor-joining method.

BCDEA
(function() { var modules={}; })();

Similarly to the UPGMA tree the sequences B and C end up in the same clade in the neighbor-joining tree. Analogously sequence A appears to form an outgroup. The key difference in this case is the way in which sequences D and E are placed onto the tree. However, neighbor-joining tree also fails to capture some of the differences in the distances, for example sequence E is equidistant from B and C in the tree.

Comparing UPGMA and Neighbor-joining for Mash distance

We will start out with the same sequence setup as in the previous run, except instead of the Jukes-Cantor distances we will use Mash distances this time. We are setting the k-mer size to 7 for this scenario, and we obtain the following distance matrix:

ABCDE
A00.158 0.1580.1440.121
B0.15800.0270.0570.079
C0.1580.02700.0710.071
D0.1440.0570.07100.034
E0.1210.0790.0710.0340

Unsurprisingly this matrix is not additive either, and the distance values tend to correlate with the Jukes-Cantor distances. Thus, as expected both inferred trees have the same topologies as in the previous section, albeit slightly different branch lengths.

BCDEA
(function() { var modules={}; })();

Now, we will try to perform a similar comparison but with a different underlying dataset. We downloaded 454 complete SARS-CoV-2 genomes collected between August 1st, 2022 and August 15th, 2022 from NCBI Virus database. We then computed the Mash distances between all these sequences using 11-, 17-, 23,-, and 31-mers. Finally for each distance matrix we inferred the phylogenetic tree using UPGMA and neighbor-joining methods.

In the above dendrogram we truncated subtrees after a fixed depth level for the ease of visualization, the labels when present indicate a leaf node with the corresponding PANGO lineage assigned to it. One thing that we immediately notice in this case is that BA.2, BA.5 (and BA.4, labels not shown) form a clade, and B.1.617.2, AY.* form a clade. This matches the expected behavior since, the first clade corresponds to the BA.2-related sublineages of Omicron variant, and the second clade corresponds to Delta variant. What is slightly surprising is that BA.1 (also a sublineage of Omicron) ends up in a clade with B.1.616 and B.1 lineages, rather than next to BA.2. We also inferred a neighbor-joining based tree and adopted leaf coloring scheme similar to the one in UPGMA.

(function() { var modules={}; })();

In this tree the deeper shade of orange represents BA.5 and its sublineages, while lighter shade of orange corresponds to BA.2 and BA.4 sublineages of Omicron. Green is used for Delta, and red for everything else. One thing that becomes more apparent with this tree is that there is subclade consisting predominantly of BA.5 sequences that is at a considerable distance from the rest of the sequences in the tree.

Another interesting observation we get from this experiment is that the k-mer size used did not influence the resulting tree shape drastically. While there are some minor changes occurring as we increase the size of k, the main clades centered around BA.5 and Delta emerge in all inferred trees. This suggests a practical observation that in the case of a large amount of closely related sequences, we can apply very approximate methods (since we stacked Mash distance with UPGMA we effectively “approximated” twice) and still capture some underlying structure in the data.

Data and code availability

In addition to the two Gists shared above you can also find the Jupyter Notebook with the code used to generate figures for this post, as well as sequence data and precomputed Mash distances in the archive attached below.

What’s next

We very briefly touched upon some of the distance-based phylogenetic inference methods and checked out several trees inferred from two evolutionary distance metrics. However, we haven’t rigorously defined how exactly two tree topologies and two phylogenetic trees can be compared. We will dive into that topic over the course of the next few posts.

References

Beyer, William A., Myron L. Stein, Temple F. Smith, and Stanislaw M. Ulam. “A molecular sequence metric and evolutionary trees.” Mathematical Biosciences 19, no. 1-2 (1974): 9-25. https://doi.org/10.1016/0025-5564(74)90028-5

Cavalli-Sforza, L.L. and Edwards, A.W.F. (1967), Phylogenetic analysis: Models and estimation procedures. Evolution, 21: 550-570. https://doi.org/10.1111/j.1558-5646.1967.tb03411.x

Felsenstein, Joseph. Inferring phylogenies. Vol. 2. Sunderland, MA: Sinauer associates, 2004.

Fitch, Walter M., and Emanuel Margoliash. “Construction of phylogenetic trees.” Science 155, no. 3760 (1967): 279-284. https://doi.org/10.1126/science.155.3760.279

Gascuel, Olivier, and Mike Steel. “Neighbor-joining revealed.” Molecular biology and evolution 23, no. 11 (2006): 1997-2000. https://doi.org/10.1093/molbev/msl072

Haeckel, E. H. P. A. (1866).Generelle Morphologie der Organismen : allgemeine Grundzüge der organischen Formen-Wissenschaft, mechanisch begründet durch die von C. Darwin reformirte Decendenz-Theorie. Berlin. [Featured image]

Huson, Daniel H., Regula Rupp, and Celine Scornavacca. Phylogenetic networks: concepts, algorithms and applications. Cambridge University Press, 2010.

Ondov, Brian D., Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. “Mash: fast genome and metagenome distance estimation using MinHash.” Genome biology 17, no. 1 (2016): 1-14. https://doi.org/10.1186/s13059-016-0997-x

Saitou, Naruya, and Masatoshi Nei. “The neighbor-joining method: a new method for reconstructing phylogenetic trees.” Molecular biology and evolution 4, no. 4 (1987): 406-425. https://doi.org/10.1093/oxfordjournals.molbev.a040454