Phylogenetic Vignettes: UPGMA and Neighbor-joining

In an effort to learn more about phylogenetics and practice my writing skills I am starting a mini-series of phylogenetic vignettes in which I will explore different concepts in phylogenetic analysis and their implementations. This series is intended as a partial account of my journey in learning some of the topics in phylogenetics. The resulting posts are not intended to be complete explorations of the corresponding questions, and will be presented at the current level of author’s understanding which can be incomplete. The intended audience for these vignettes are undergraduate and graduate students with solid understanding of basic concepts in discrete mathematics, as well as some knowledge of analysis and probability. I will assume no deep knowledge of evolutionary biology, in part due to my own background lacking strong biology foundation.

Preliminaries

We will follow the notation adopted in Huson, Rupp, and Scornavacca, 2010.

NotationDescription
\mathcal{X}a finite set of taxa (for non-bio folks: a set of labels)
Ta unrooted phylogenetic tree, i.e. a undirected connected graph G=(V, E) with no cycles and in which \mathrm{deg}(v)\neq 2, \forall v\in V together with a leaf labeling \lambda:\mathcal{X}\to V that assigns exactly one taxon to each leaf and none to any internal node
\omegaa weight assignment for a given phylogenetic tree T=(V, E, \mathcal{X}, \lambda), i.e. a map \omega: E\to \mathbb{R} (in practice we are interested only in the case \omega: E\to (0,\infty))
Da pairwise distance matrix for the set of genomic sequences corresponding to the set of taxa \mathcal{X}

We will also use the term molecular clock to refer to the situations in which the rate of mutation is fixed per unit of time. Thus the molecular clock hypothesis is a useful assumption for inference of trees for the sequences that were collected at the same point in time, hence we can assume in that scenario that in the rooted phylogenetic tree distance from root \rho to every leaf is constant.

Distance based tree inference

In general, phylogenetic tree inference problem consists of inferring an optimal rooted or unrooted phylogenetic tree T given a set of sequences representing taxa \mathcal{X}. Here by optimal we mean optimization with respect to some evolutionary model over the full parameter space defined by topologies of T and weights \omega.

The exact criterion with respect to which the optimization is done varies between various formulations of the problem. However, one thing that is common to all formulations is the prohibitively large size of the search space. A simple calculation shows that the number of bifurcating trees (i.e. every non-leaf node has exactly two children or alternatively every non-leaf has degree 3) on n taxa is (2n-3)!!.

In the case of distance based phylogenetic tree inference our input is the n\times n pairwise distance matrix D. We will explore different options for measuring distance between input sequences in a later section.

Least squares

We can start with an attempt at defining an optimization goal for our tree inference problem given the input distance matrix D. Intuitively the goal here is to find a tree that best approximates (explains) the observed distances. More formally, given a phylogenetic tree T with edge weights \omega we can define the tree-induced metric on the set of taxa \mathcal{X} as d_T(x, y)=\sum_{e\in x\sim y}\omega(e), where we use the notation x\sim y to denote the path between leaves x, y of the tree (in a tree any two nodes are connected by a single path). Thus, we want to minimize the error \parallel D-D_T\parallel under some norm. One possible choice of the norm is the Frobenius norm, which in turn corresponds to unweighted least squares tree inference. As noted in Felsenstein, 2004 the least squares formulation can be weighted, i.e. the function to minimize becomes \sum_{i=1}^n\sum_{j=1}^nw_{ij}(D_{ij}-{D_T}_{ij})^2 with the choice w_{ij}=1 proposed by Cavalli-Sforza and Edwards, 1967, w_{ij}=1/D^2_{ij} by Fitch and Margoliash, 1967, and w_{ij}=1/D_{ij} by Beyer et al., 1974.

While for a fixed tree topology the problem boils down to least squares optimization, the general problem of tree inference under this model requires searching over the large space of trees. Thus, while the formulation behind this method is sound and appealing, in practice we tend to rely on heuristics when searching the tree space.

Instead of fully exploring least squares method and its derivatives and search heuristics we will now pivot to a different approach in which instead of explicit optimization under a given metric we will instead focus on direct clustering goal.

UPGMA

The first approach we will explore is called Unweighted Pair Group Method with Arithmetic Mean (UPGMA). For those familiar with clustering methods, and hierarchical clustering in particular, this name might ring a bell. It appears that the name UPGMA mostly stuck with the phylogenetics community, and in the general clustering world this approach is referred to as agglomerative (i.e. bottom-up) clustering with average linkage (scikit-learn reference, scipy reference).

Implementing UPGMA naively is straightforward and we will illustrate it in the following code snippet with the distance matrix data coming from Felsenstein, 2004 (pp. 162-166).

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Note that the above implementation doesn’t keep track of the branch lengths when providing the output and as a result our tree has the correct topology but misses the edge weight information. Alternatively we can simply use Scipy hierarchical clustering module and achieve the same result including the branch lengths in fewer lines of code.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

One thing that becomes apparent from the second plot is that the resulting tree is an ultrametric tree.

In general we call a metric space (M, d) and ultrametric space if in addition to the standard properties the metric d satisfies the strong triangle inequality d(x, y)\leq\mathrm{max}\{d(x, z), d(y, z)\} for all x, y, z\in M. Ultrametric spaces are somewhat bizarre looking since every triangle in such space is isosceles. Furthermore, every point inside an open ball of positive radius is its center, and all open balls of positive radius are clopen. However, ultrametric spaces arise naturally in several settings with the one of interest to us being phylogenetic trees under the molecular clock hypothesis. Such trees are commonly referred to as ultrametric trees.

Now, one of the interesting properties of the UPGMA algorithm is that the tree constructed by this method will always be ultrametric. However, this also implies that if the original distance matrix does not induce an ultrametric on \mathcal{X} then the tree produced by the UPGMA algorithm will have a non-zero error \parallel D-D_T\parallel. For example, consider the following sample distance matrix given as an example in Huson, Rupp, and Scornavacca, 2010 (p. 53):

ABCDE
A02754
B20754
C77045
D55403
E44530

The inferred UPGMA tree will look as following

Due to the metric induced by the tree the distances between C and D, and C and E, will be equal in T while we know from the original distance matrix that the distance from C to D is supposed to be larger. Thus, UPGMA model is not broadly applicable since it does require the original distance matrix to satisfy the ultrametric condition. Conversely, Huson, Rupp, and Scornavacca, 2010 Lemma 3.13.2 shows that if the distance matrix D is ultrametric then the tree inferred by UPGMA satisfies D_T=D.

In terms of the computational complexity UPGMA requires O(|\mathcal{X}|^3|) time, which is a significant speed up when compared to brute-force search over the tree space.

Felsenstein, 2004 notes that in some scenarios, for example for closely related species, where the molecular clock is a reasonable starting assumption the method beahves well. However, in the general setting where we cannot claim a fixed rate for the whole tree UPGMA approach becomes problematic.

Neighbor-joining

Neighbor-joining is an alternative method for inferring a phylogenetic tree from the distance matrix D. Originally it was proposed by Saitou and Nei, 1987 and has since been implemented in several libraries and software packages. We will use the implementation provided in scikit-bio (in case if you get a conflict with Scipy, try downgrading Scipy to version 1.5.2), and skip discussing the details of the algorithm in this post.

Motivation for considering neighbor-joining is that it doesn’t require the underlying distance matrix to be ultrametric in order to faithfully infer a phylogenetic tree. However, in order to guarantee that the algorithm will infer a tree T with D_T=D we still require that the distance matrix is “tree-like”. This vague term can be made precise via the Definition 3.12.4 from Huson, Rupp, and Scornavacca, 2010 (p. 51) also referred to as the four-point condition: a distance matrix D is said to satisfy the four-point condition if for every quadruple of taxa x,y,z,w\in\mathcal{X} we have d(w, x)+d(y, z)\leq \mathrm{max}\{d(w,y)+d(x,z), d(w,z)+d(x,y)\}. In the case when the distance matrix D satisfies the above condition we can guarantee that D_T=D for the inferred tree T. The proof of the above claim can be found in the original paper.

Neighbor-joining has the same computational complexity as UPGMA, and both methods are agglomerative clustering approaches that differ in the distance and update rules. One interesting aspect of the neighbor joining method is that it has been shown to greedily optimize balanced minimum evolution metric on the tree space (Gascuel and Steel, 2006). We will not explore the details of the balanced minimum evolution model at this point, but we might return to the topic in later post.

Exploring sequence distance metrics

So far we only considered examples in which we had a predefined distance matrix D provided to us from the beginning. In this section we will explore a few choices for constructing D given a set of genomic sequences and compare how different distance based inference methods fair in these scenarios.

Jukes-Cantor model and distance

The simplest substitution model used in phylogenetics is the Jukes-Cantor model. Under this model we assume that the substitution rate is fixed and equal and the base frequencies are all equal to 1/4. Under these assumptions it is easy to derive the correspondence between the expected number of substitutions (i.e. evolutionary distance) and the fraction of distinct bases in the alignment (i.e. normalized Hamming distance). Thus, given two aligned sequences x, y we will refer to the quantity JC(x, y)=-\frac{3}{4}\ln\left(1-\frac{4}{3}H(x, y)\right), where H(x, y) is the normalized Hamming distance, as the Jukes-Cantor distance between x and y.

Mash distance

MinHash sketching is an effective computational technique for estimating similarity of two sets of data. Due to its computational efficiency MinHash has found its into many data intensive areas of computer science, including bioinformatics. In particular, due to the computational efficiency it’s common that we view sequences as sets (or multi-sets) of k-mers (substrings of length k, think k-grams in NLP), and hence we can compare the (Jaccard) similarity of two sequences via MinHash on the respective k-mers. However, upon closer inspection it is clear that Jaccard distance is not a good proxy for the evolutionary distance (similarly how normalized Hamming distance is not exactly the same thing as Jukes-Cantor distance). Thus, we want to compute an appropriate adjustment that turns our estimated Jaccard index into an approximate evolutionary metric. One possible approach is given by Ondov et al., 2006, Eq. 4 which defines Mash distance between two sequences (or sets of sequences) as D=-\frac{1}{k}\ln\frac{2j}{1+j} where k is the length of the k-mer and j is the estimated Jaccard index.

Comparing UPGMA and Neighbor-joining for JC distance

We will begin with a set of 5 sequences each consisting of 70 nucleotides and containing 5 single nucleotide changes with respect to the first 70 nucleotides of the N gene of the reference genome for SARS-CoV-2.

Reference:
ATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAGATTCAA
Sequences:
ATTGCTGATACTGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGAACCTCAGATCCAA
ATGTTTGATAATGGACCCCAAATTCAGCGAAATGCACGCCGCATTACGTTCGGTGGACCCTCAGATTCCA
ATGTTTGATAATGGACCCCAAATTCAGCGAAATGCACCCTGCATTACGTTCGGTGGACCCTCAGATTCGA
ATGTCCGATAATGGACCCCAAATTCAGCGAAATGCACTCCGCATTACGTTTGATGGACCCTCAGATTCCA
ATGTCCGATAATGGACCCCAAATTCAGCGAAATGCACCCGGCATTACGTTTGATGGACCCTCAGATCCAA

We will label the sequences in this set A, B, C, D, and E in top to bottom order. We get the following matrix of Jukes-Cantor distances for this set:

ABCDE
A00.158 0.1580.1580.124
B0.15800.0440.0750.124
C0.1580.04400.1070.107
D0.1580.0750.10700.059
E0.1240.1240.1070.0590

It is easy to check that this matrix is not additive, and hence is also not an ultrametric. Thus, we do not expect either of the algorithms to yield an optimal tree. Nevertheless we would like to compare the two topologies we get. First we will consider the UPGMA tree:

We can see that the sequences A ends up acting as an outgroup, which is not surprising if we return and examine the mutation content within the sequence. Sequences B and C, and D and E form two clades in the UPGMA tree. Examining the alignment by eye it appears as a plausible grouping. However, we note that the distance between sequences B and D is estimated to be smaller than that between B and E, while the inferred tree does not capture this difference.

Next, we will consider the tree obtained from the neighbor-joining method.

BCDEA
(function() { var modules={}; })();

Similarly to the UPGMA tree the sequences B and C end up in the same clade in the neighbor-joining tree. Analogously sequence A appears to form an outgroup. The key difference in this case is the way in which sequences D and E are placed onto the tree. However, neighbor-joining tree also fails to capture some of the differences in the distances, for example sequence E is equidistant from B and C in the tree.

Comparing UPGMA and Neighbor-joining for Mash distance

We will start out with the same sequence setup as in the previous run, except instead of the Jukes-Cantor distances we will use Mash distances this time. We are setting the k-mer size to 7 for this scenario, and we obtain the following distance matrix:

ABCDE
A00.158 0.1580.1440.121
B0.15800.0270.0570.079
C0.1580.02700.0710.071
D0.1440.0570.07100.034
E0.1210.0790.0710.0340

Unsurprisingly this matrix is not additive either, and the distance values tend to correlate with the Jukes-Cantor distances. Thus, as expected both inferred trees have the same topologies as in the previous section, albeit slightly different branch lengths.

BCDEA
(function() { var modules={}; })();

Now, we will try to perform a similar comparison but with a different underlying dataset. We downloaded 454 complete SARS-CoV-2 genomes collected between August 1st, 2022 and August 15th, 2022 from NCBI Virus database. We then computed the Mash distances between all these sequences using 11-, 17-, 23,-, and 31-mers. Finally for each distance matrix we inferred the phylogenetic tree using UPGMA and neighbor-joining methods.

In the above dendrogram we truncated subtrees after a fixed depth level for the ease of visualization, the labels when present indicate a leaf node with the corresponding PANGO lineage assigned to it. One thing that we immediately notice in this case is that BA.2, BA.5 (and BA.4, labels not shown) form a clade, and B.1.617.2, AY.* form a clade. This matches the expected behavior since, the first clade corresponds to the BA.2-related sublineages of Omicron variant, and the second clade corresponds to Delta variant. What is slightly surprising is that BA.1 (also a sublineage of Omicron) ends up in a clade with B.1.616 and B.1 lineages, rather than next to BA.2. We also inferred a neighbor-joining based tree and adopted leaf coloring scheme similar to the one in UPGMA.

(function() { var modules={}; })();

In this tree the deeper shade of orange represents BA.5 and its sublineages, while lighter shade of orange corresponds to BA.2 and BA.4 sublineages of Omicron. Green is used for Delta, and red for everything else. One thing that becomes more apparent with this tree is that there is subclade consisting predominantly of BA.5 sequences that is at a considerable distance from the rest of the sequences in the tree.

Another interesting observation we get from this experiment is that the k-mer size used did not influence the resulting tree shape drastically. While there are some minor changes occurring as we increase the size of k, the main clades centered around BA.5 and Delta emerge in all inferred trees. This suggests a practical observation that in the case of a large amount of closely related sequences, we can apply very approximate methods (since we stacked Mash distance with UPGMA we effectively “approximated” twice) and still capture some underlying structure in the data.

Data and code availability

In addition to the two Gists shared above you can also find the Jupyter Notebook with the code used to generate figures for this post, as well as sequence data and precomputed Mash distances in the archive attached below.

What’s next

We very briefly touched upon some of the distance-based phylogenetic inference methods and checked out several trees inferred from two evolutionary distance metrics. However, we haven’t rigorously defined how exactly two tree topologies and two phylogenetic trees can be compared. We will dive into that topic over the course of the next few posts.

References

Beyer, William A., Myron L. Stein, Temple F. Smith, and Stanislaw M. Ulam. “A molecular sequence metric and evolutionary trees.” Mathematical Biosciences 19, no. 1-2 (1974): 9-25. https://doi.org/10.1016/0025-5564(74)90028-5

Cavalli-Sforza, L.L. and Edwards, A.W.F. (1967), Phylogenetic analysis: Models and estimation procedures. Evolution, 21: 550-570. https://doi.org/10.1111/j.1558-5646.1967.tb03411.x

Felsenstein, Joseph. Inferring phylogenies. Vol. 2. Sunderland, MA: Sinauer associates, 2004.

Fitch, Walter M., and Emanuel Margoliash. “Construction of phylogenetic trees.” Science 155, no. 3760 (1967): 279-284. https://doi.org/10.1126/science.155.3760.279

Gascuel, Olivier, and Mike Steel. “Neighbor-joining revealed.” Molecular biology and evolution 23, no. 11 (2006): 1997-2000. https://doi.org/10.1093/molbev/msl072

Haeckel, E. H. P. A. (1866).Generelle Morphologie der Organismen : allgemeine Grundzüge der organischen Formen-Wissenschaft, mechanisch begründet durch die von C. Darwin reformirte Decendenz-Theorie. Berlin. [Featured image]

Huson, Daniel H., Regula Rupp, and Celine Scornavacca. Phylogenetic networks: concepts, algorithms and applications. Cambridge University Press, 2010.

Ondov, Brian D., Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. “Mash: fast genome and metagenome distance estimation using MinHash.” Genome biology 17, no. 1 (2016): 1-14. https://doi.org/10.1186/s13059-016-0997-x

Saitou, Naruya, and Masatoshi Nei. “The neighbor-joining method: a new method for reconstructing phylogenetic trees.” Molecular biology and evolution 4, no. 4 (1987): 406-425. https://doi.org/10.1093/oxfordjournals.molbev.a040454

Sets

General definitions

From Wikipedia:

In mathematics, a set is a collection of distinct objects, considered as an object in its own right. For example, the numbers 2, 4, and 6 are distinct objects when considered separately, but when they are considered collectively they form a single set of size three, written {2, 4, 6}. The concept of a set is one of the most fundamental in mathematics. Developed at the end of the 19th century, set theory is now a ubiquitous part of mathematics, and can be used as a foundation from which nearly all of mathematics can be derived. In mathematics education, elementary topics from set theory such as Venn diagrams are taught at a young age, while more advanced concepts are taught as part of a university degree.

To break this down into simpler terms there are two important aspects of what constitutes a set:

  1. A set is a collection of distinct objects.
  2. A set itself constitutes an object, i.e. we can think of it as a tangible collection.

An example of a set can be pizza offerings at Giordano’s (a pizzeria in Chicago). This set contains distinct elements: Pepperoni pizza, Supreme pizza, Goat cheese and spinach pizza, Italian sausage pizza, Margherita pizza; and is in itself an object: a pizza menu.

The code below illustrates how we can declare a set in Python.

my_set = {0,3,4,0,7,9,13,35,0}
print(my_set)
{0, 3, 4, 35, 7, 9, 13}

We can see that in fact, even if we declared some non-distinct (i.e. repeated) elements, the set doesn’t contain them, as evidenced by the print() function.

Set Membership and Subsets

Given an object and a set we can test whether this object belongs to the given set. This is a check for set membership. We can also verify if an object does not belong to a set.

Given a set $A$ and an object $x$, we use the notation $x\in A$ to denote that $x$ is an element of $A$. We also use notation $x\notin A$ to denote that $x$ is not an element of $A$.

The code below illustrates how we can test these conditions in Python.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
print(3 in A)                         # Will print True, because 3 is an element of A.
print(6 in A)                         # Will print False, because 6 is not an element of A.
print(7 not in A)                     # Will print False, because 7 is an element of A.    (note the use of "not")
print(8 not in A)                     # Will print True, because 8 is not an element of A. (note the use of "not")
True
False
False
True

Another important relation is that of being a subset. If membership is a relation between an object and a set, then being a subset is a relation between two sets. Namely we say that $B$ is a subset of $A$, denoted $B\subseteq A$, if every element of $B$ is also an element of $A$. We also will say that $B$ is a proper subset of $A$, denoted $B\subset A$, if every element of $B$ is also an element of $A$, but there are elements in $A$ that are not in $B$.

The code below illustrates how we can test these relations in Python, and provides some examples of subsets and proper subsets.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 3, 5, 7}                      # B is the set of prime numbers less than 10.

print(B.issubset(A))                  # Check if B is a subset of A. Will print True.
print(B <= A)                         # Check if B is a subset of A. Will print True.
print(B < A)                          # Check if B is a proper subset of A. Will print True,
                                      # since all elements of B are in A (subset condition),
                                      # but 11 is in A, and not in B (proper condition).
        
print(A.issubset(A))                  # Check if A is a subset of A. Will print True. 
print(A <= A)                         # Check if A is a subset of A. Will print True.
print(A < A)                          # Check if A is a proper subset of A. Will print False,
                                      # since all elements of A are in A.
                                      # Note: a set is always a subset of itself.
True
True
True
True
True
False

Set Operations

Now, let us take a look at some common set operations. As many things in mathematics, these concepts can become more natural if visualized. Hence, let us briefly introduce the idea of Venn diagrams.

A Venn diagram is a schematic representation of a set and its possible relations with other sets. We usually will use (possibly misshapen) circles to denote the “set” and colors or the elements itself to denote the elements of this set. The few examples below will illustrate this idea.

Venn diagram of letters

Venn's four ellipses diagram

Set Union

The first set operation we will look at is set union. We can think of it as addition for the sets. The result of a set union is the set containing elements that appear in either of the sets. The following Venn diagram shows in red the union of sets $A$ and $B$, denoted $A\cup B$.

A union B

We can compute a union of two sets in Python by using the union method or by using | operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A | B
print(C)
{2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}
A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.union(B)
print(C)
{2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}
# Note that any possible "overlapping" elements will be only accounted for once,
# and thus the result will be a set (elements are distinct). This can be seen from
# the Venn diagram (the intersecting region is covered once) and from the example
# below.
A = {4, 5, 6, 7, 8, 9}
B = {6, 7, 8, 9, 10, 11}
C = A | B
print(C)
{4, 5, 6, 7, 8, 9, 10, 11}

Set Intersection

The next operation is set intersection. A set intersection is a set (possibly an empty one) that contains elements that appear in both sets. In other words, intersection is the overlap of the original sets. The following diagram shows the intersection of sets $A$ and $B$, denoted $A\cap B$.

A intersection B

We can compute an intersection of two sets in Python by using the intersection method or by using & operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A & B
print(C)
{2}
A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.intersection(B)
print(C)
{2}
# Only the elements present in BOTH sets get into the intersection. Thus in some
# cases the intersection can be empty. A Venn diagram for thsi case would be two
# non-overlapping circles.
A = {4, 5, 6, 7, 8, 9}
B = {10, 11, 12, 13, 14}
C = A & B
print(C)
set()

Set Difference

Next operation we will look at is the set difference. It is useful to know which elements belong to one set, but not the other. The set difference is a set that contains elements from the first set, but not the second one. The following diagram shows the difference of sets $A$ and $B$, denoted $A – B$ or $A\setminus B$.

A difference B

We can compute a difference between two sets in Python by using the difference method or by using - operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A - B
print(C)
{3, 5, 7, 11, 13, 17, 19}
A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.difference(B)
print(C)
{3, 5, 7, 11, 13, 17, 19}
# Note that just like the difference of two numbers depends on the order, the difference
# of two sets also depends on which one we want to subtract from. The example below
# illustrates this idea.
A = {4, 5, 6, 7, 8, 9}
B = {7, 8, 9, 10, 11}
C = A - B
D = B - A
print("A - B is {}".format(C))
print("B - A is {}".format(D))
A - B is {4, 5, 6}
B - A is {10, 11}

Set Symmetric Difference

The last set operation we will talk about is the symmetric difference. There are several ways you can think about the symmetric difference, but all of them encapsulate the same idea. We want to have a set that has elements that appear in either $A$ or $B$, but not in the both sets. Using the notation defined above we can write this as $(A\cup B) – (A\cap B)$ (the union/sum of the sets minus their intersection) or alternatively as $(A – B) \cup (B – A)$ (the $A$ without $B$ union $B$ without $A$). The following diagram shows the symmetric difference of sets $A$ and $B$, denoted $A \Delta B$.

A symmetric difference B

We can compute the symmetric difference between two sets in Python by using the symmetric_difference method or by using ^ operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A ^ B
print(C)
{3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}
A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.symmetric_difference(B)
print(C)
{3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}
# The symmetric difference, unlike the regular set difference is symmetric. 
# Which means that the order of sets does not matter, the result will be
# the same, as illustrated by code below.
A = {4, 5, 6, 7, 8, 9}
B = {6, 7, 8, 9, 10, 11}
C = A ^ B
D = B ^ A
print(C)
print(D)
{4, 5, 10, 11}
{4, 5, 10, 11}

Additional remarks

Just like arithmetic operations are defined using two numbers, but can be extended to lengthier expressions, the set operation can be applied to multiple sets. In Python the easiest way to encapsulate this concept is using the set operations |, &, -, ^ and the appropriate () to group those operations. Examples below illustrate this idea.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = {3, 5, 7, 9, 11, 13, 15, 17, 19}  # C is the set of odd numbers > 1 and < 20.
D = {3, 6, 9, 12, 15, 18}             # D is the set of numbers divisble by 3, > 0 and < 20. 
# This set is the union of C and D minus A. 
# Effectively it will contain numbers divisible by 3 or odd, that lie between 1 and 20,
# but will not contain the numbers that are prime.
E = (C | D) - A 
print(E)
{6, 9, 12, 15, 18}
# This set is a symmetric difference of A, B and difference between
# C and D. 
# Effectively it will contain numbers that are either prime, or even, or
# odd, but not divisible by 3. However, it will not contain numbers that satisfy
# more than two of those conditions at the same time (i.e. it won't contain 11,
# since it is both prime and not divisible by 3).
F = A ^ B ^ (C - D)
print(F)
{3, 4, 6, 8, 10, 12, 14, 16, 18}

Use Cases

We went through the trouble of learning the definitions for basic set arithmetic (the Operations section) and membership and subset relations (the Membership and Subsets section), so now is a good time to present some use cases for these structures and operations.

Besides being an essential building block in modern mathematics, sets often present a highly convenient data structure in programming. The examples below will guide you through some useful applications of sets in programming. Some of these examples are inspired by real production code.

Filtering down unwieldy lists

Sometimes we are faced with the problem of filtering a rather large list to only show unique values. A few common examples include the following:

  1. Identifying the unique caller IDs for a large list of phone calls.
  2. Identifying categories of the items carried by a store from the full inventory list.

Below we will address both of the problems by leveraging the property that a set contains distinct elements, and hence will effectively filter out only the unique elements.

# Problem 1.
# ----------
# Write a function that takes in a list of phone numbers (as strings),
# and returns a list containing the unique phone numbers from the original
# list.
#
# Input: list of phone numbers.
# 
# Output: list of unique phone numbers.
def phone_id_unique(numbers):
    unique_numbers_set = set(numbers)
    unique_numbers_list = list(unique_numbers_set)
    return unique_numbers_list
# Problem 1.
# ----------
# Tests:
#
# 1. Input:  ["800-000-0000" repeated 1 000 000 times]
#    Output: ["800-000-0000"]
test_input = ["800-000-0000"] * 1000000
print(phone_id_unique(test_input))

# 2. Input:  ["800-100-0000" repeated 1 000 000 times, "800-200-0000" repeated 1 000 000 times, ...,
#             "800-900-0000" repeated 1 000 000 times]
#    Output: ["800-000-0000", "800-100-0000", ..., "800-900-0000"]
test_input = []
for i in range(1, 10):
    test_input = test_input + ["800-{}00-0000".format(i)] * 1000000
print(phone_id_unique(test_input))

# 3. Input: ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
#    Output: ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
test_input = ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
print(phone_id_unique(test_input))
['800-000-0000']
['800-200-0000', '800-500-0000', '800-700-0000', '800-300-0000', '800-600-0000', '800-400-0000', '800-800-0000', '800-100-0000', '800-900-0000']
['800-020-0000', '800-000-0000', '800-010-0000', '800-030-0000']
# Problem 2.
# ----------
# Write a function that takes in a list of store carried product (as dictioanries),
# and returns a list containing the product categories that appear in the original
# list.
#
# Input: list of items.
# 
# Output: list of product categories.
def product_categories(items):
    categories_set = set([item["category"] for item in items])
    categories = list(categories_set)
    return categories
# Before testing we will load some data from .csv files. These files should be put into
# the same directory as the notebook. CSV stands for comma-separated values, and is a 
# common standard for representing data in text format.
items = []

import csv
with open("data_produce.csv", "r") as f:
    reader = csv.reader(f, delimiter = ",")
    for line in reader:
        items.append({"id": line[0], 
                      "category": line[1],
                      "stock": line[2],
                      "price": line[3]})

# Tests:
#
# 1. Input:  [1000000 items from 8 categories]
#    Output: ["perishables", "water", "kitchen", "furniture", "electronics", "paper", "pantry", "misc"]
print(product_categories(items))
[' pantry', ' furniture', ' water', ' kitchen', ' perishables', ' paper', ' electronics', ' misc']

Implementing common logical operations

Mathematical logic and set arithmetic are tightly connected. This allows us to use set arithmetic to model common logical operations, which in turn can easily encapsulate some everyday tasks we want to perform with out data.

Set union is analogous to logical OR operation, set intersection to logical AND, and the symmetric difference is analogous to logical XOR (exclusive OR) operation. Thus, we can use these operations to translate common tasks into set operations. Let us look at some of the examples below.

# Problem 3.
# ----------
# Write a function that takes in a set of items on mom's shopping list,
# a set of items on dad's shopping list, a set of items already bought by
# mom, a set of items already bought by dad, and finally a set of items
# that are currently in the fridge. The output should be a consolidated
# shopping list, i.e. it should only include the items that are not in the
# fridge and are not yet bought. 
#
# Input: 5 sets of items as described above.
# 
# Output: list of items that need to be procured.
def shopping_list_cons(mom_to_buy, dad_to_buy,
                       mom_bought, dad_bought,
                       in_fridge):
    all_to_buy = (mom_to_buy | dad_to_buy)
    all_bought = (mom_bought | dad_bought)
    to_buy = list(all_to_buy - all_bought - in_fridge)
    return to_buy
# Problem 3.
# ----------
# Tests:
#
# 1. Input:  mom_to_buy = {"apples", "candy", "chicken", "beef"}
#            dad_to_buy = {"candy", "beef", "bread", "cola"}
#            mom_bought = {"bread", "biscuits"}
#            dad_bought = {"milk", "coffee"}
#            in_fridge  = {"eggs", "chicken"}
#    Output: ["apples", "candy", "beef", "cola"]
mom_to_buy = {"apples", "candy", "chicken", "beef"}
dad_to_buy = {"candy", "beef", "bread", "cola"}
mom_bought = {"bread", "biscuits"}
dad_bought = {"milk", "coffee"}
in_fridge  = {"eggs", "chicken"}
print(shopping_list_cons(mom_to_buy, dad_to_buy, mom_bought, dad_bought, in_fridge))
['apples', 'candy', 'beef', 'cola']

Here we take advantage of some set operations to solve the problem. First, we consolidate both “to buy” lists taking their union, thus ensuring that all items are accounted for and none are double counted. Then we consolidate the already bought items by taking another union. Finally, we take the difference between what we need to buy and what is already bought or already in the fridge.

In the next problem will take a look at some applications of the symmetric difference.

# Problem 4.
# ----------
# Students at Chicken Soup High-school are offered two
# options for Calculus classes. There is an "Intro to Calculus"
# class and "Calculus" class. Some students take only 
# the first class through their time in high-school,
# some only take the second class, by placing out of the
# first one, and finally some students take both classes
# as a sequence.
# At the end of each year, average for these classes performance
# is computed to evaluate effectiveness of instructors. 
# The average is computed according to a strange formula,
# because statistics and performance department of 
# Chicken Soup high loves hard to understand numbers.
# You are provided with the formula and the list of 
# students in each class and their grades. To protect
# students' privacy you are given unique StudentIDs.
# 
# AVG = (avg grade for students who only took "Intro") +
#     + (avg grade for students who only took "Calculus") +
#     + 1.75 * (avg grade for students who took both)
#
# Write a function that takes in two dictionaries of 
# StudentIDs and grades and computes the average according
# to the given formula.
#
# Input: a dictionary for students who took "Intro to Calculus",
#        a dictionary for students who took "Calculus".
# 
# Output: the average grade.
import numpy as np
def chicken_soup_high_avg(intro, calc):
    one_course = set(intro.keys()) ^ set(calc.keys())
    both_courses = set(intro.keys()) & set(calc.keys())
    one_grades = []
    both_grades = []
    for studentID in one_course:
        try:
            one_grades.append(intro[studentID])
        except:
            pass
        
        try:
            one_grades.append(calc[studentID])
        except:
            pass
    
    for studentID in both_courses:
        both_grades.append(intro[studentID])
        both_grades.append(calc[studentID])
        
    avg = np.mean(one_grades) + 1.75 * np.mean(both_grades)
    
    return avg
# Problem 4.
# ----------
# Tests:
#
# 1. Input: {"1": 4.0, "2": 3.75, "3": 3.4},
#           {"2": 3.8, "3": 3.0, "4": 4.0}
#    Output: 10.10
print("{:.2f}".format(chicken_soup_high_avg({"1": 4.0, "2": 3.75, "3": 3.4},
                                            {"2": 3.8, "3": 3.0, "4": 4.0})))
10.10

Further remarks

While not strictly necessary, sets can make several classic graph algorithms easier to write and explain. We will cover those in the graphs section.

Downloads