General definitions

From Wikipedia:

In mathematics, a set is a collection of distinct objects, considered as an object in its own right. For example, the numbers 2, 4, and 6 are distinct objects when considered separately, but when they are considered collectively they form a single set of size three, written {2, 4, 6}. The concept of a set is one of the most fundamental in mathematics. Developed at the end of the 19th century, set theory is now a ubiquitous part of mathematics, and can be used as a foundation from which nearly all of mathematics can be derived. In mathematics education, elementary topics from set theory such as Venn diagrams are taught at a young age, while more advanced concepts are taught as part of a university degree.

To break this down into simpler terms there are two important aspects of what constitutes a set:

A set is a collection of distinct objects.
A set itself constitutes an object, i.e. we can think of it as a tangible collection.

An example of a set can be pizza offerings at Giordano’s (a pizzeria in Chicago). This set contains distinct elements: Pepperoni pizza, Supreme pizza, Goat cheese and spinach pizza, Italian sausage pizza, Margherita pizza; and is in itself an object: a pizza menu.

The code below illustrates how we can declare a set in Python.

my_set = {0,3,4,0,7,9,13,35,0}
print(my_set)

{0, 3, 4, 35, 7, 9, 13}

We can see that in fact, even if we declared some non-distinct (i.e. repeated) elements, the set doesn’t contain them, as evidenced by the print() function.

Set Membership and Subsets

Given an object and a set we can test whether this object belongs to the given set. This is a check for set membership. We can also verify if an object does not belong to a set.

Given a set $A$ and an object $x$, we use the notation $x\in A$ to denote that $x$ is an element of $A$. We also use notation $x\notin A$ to denote that $x$ is not an element of $A$.

The code below illustrates how we can test these conditions in Python.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
print(3 in A)                         # Will print True, because 3 is an element of A.
print(6 in A)                         # Will print False, because 6 is not an element of A.
print(7 not in A)                     # Will print False, because 7 is an element of A.    (note the use of "not")
print(8 not in A)                     # Will print True, because 8 is not an element of A. (note the use of "not")

True
False
False
True

Another important relation is that of being a subset. If membership is a relation between an object and a set, then being a subset is a relation between two sets. Namely we say that $B$ is a subset of $A$, denoted $B\subseteq A$, if every element of $B$ is also an element of $A$. We also will say that $B$ is a proper subset of $A$, denoted $B\subset A$, if every element of $B$ is also an element of $A$, but there are elements in $A$ that are not in $B$.

The code below illustrates how we can test these relations in Python, and provides some examples of subsets and proper subsets.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 3, 5, 7}                      # B is the set of prime numbers less than 10.

print(B.issubset(A))                  # Check if B is a subset of A. Will print True.
print(B <= A)                         # Check if B is a subset of A. Will print True.
print(B < A)                          # Check if B is a proper subset of A. Will print True,
                                      # since all elements of B are in A (subset condition),
                                      # but 11 is in A, and not in B (proper condition).
        
print(A.issubset(A))                  # Check if A is a subset of A. Will print True. 
print(A <= A)                         # Check if A is a subset of A. Will print True.
print(A < A)                          # Check if A is a proper subset of A. Will print False,
                                      # since all elements of A are in A.
                                      # Note: a set is always a subset of itself.

True
True
True
True
True
False

Set Operations

Now, let us take a look at some common set operations. As many things in mathematics, these concepts can become more natural if visualized. Hence, let us briefly introduce the idea of Venn diagrams.

A Venn diagram is a schematic representation of a set and its possible relations with other sets. We usually will use (possibly misshapen) circles to denote the “set” and colors or the elements itself to denote the elements of this set. The few examples below will illustrate this idea.

Set Union

The first set operation we will look at is set union. We can think of it as addition for the sets. The result of a set union is the set containing elements that appear in either of the sets. The following Venn diagram shows in red the union of sets $A$ and $B$, denoted $A\cup B$.

We can compute a union of two sets in Python by using the union method or by using | operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A | B
print(C)

{2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.union(B)
print(C)

{2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}

# Note that any possible "overlapping" elements will be only accounted for once,
# and thus the result will be a set (elements are distinct). This can be seen from
# the Venn diagram (the intersecting region is covered once) and from the example
# below.
A = {4, 5, 6, 7, 8, 9}
B = {6, 7, 8, 9, 10, 11}
C = A | B
print(C)

{4, 5, 6, 7, 8, 9, 10, 11}

Set Intersection

The next operation is set intersection. A set intersection is a set (possibly an empty one) that contains elements that appear in both sets. In other words, intersection is the overlap of the original sets. The following diagram shows the intersection of sets $A$ and $B$, denoted $A\cap B$.

We can compute an intersection of two sets in Python by using the intersection method or by using & operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A & B
print(C)

{2}

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.intersection(B)
print(C)

{2}

# Only the elements present in BOTH sets get into the intersection. Thus in some
# cases the intersection can be empty. A Venn diagram for thsi case would be two
# non-overlapping circles.
A = {4, 5, 6, 7, 8, 9}
B = {10, 11, 12, 13, 14}
C = A & B
print(C)

set()

Set Difference

Next operation we will look at is the set difference. It is useful to know which elements belong to one set, but not the other. The set difference is a set that contains elements from the first set, but not the second one. The following diagram shows the difference of sets $A$ and $B$, denoted $A – B$ or $A\setminus B$.

We can compute a difference between two sets in Python by using the difference method or by using - operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A - B
print(C)

{3, 5, 7, 11, 13, 17, 19}

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.difference(B)
print(C)

{3, 5, 7, 11, 13, 17, 19}

# Note that just like the difference of two numbers depends on the order, the difference
# of two sets also depends on which one we want to subtract from. The example below
# illustrates this idea.
A = {4, 5, 6, 7, 8, 9}
B = {7, 8, 9, 10, 11}
C = A - B
D = B - A
print("A - B is {}".format(C))
print("B - A is {}".format(D))

A - B is {4, 5, 6}
B - A is {10, 11}

Set Symmetric Difference

The last set operation we will talk about is the symmetric difference. There are several ways you can think about the symmetric difference, but all of them encapsulate the same idea. We want to have a set that has elements that appear in either $A$ or $B$, but not in the both sets. Using the notation defined above we can write this as $(A\cup B) – (A\cap B)$ (the union/sum of the sets minus their intersection) or alternatively as $(A – B) \cup (B – A)$ (the $A$ without $B$ union $B$ without $A$). The following diagram shows the symmetric difference of sets $A$ and $B$, denoted $A \Delta B$.

We can compute the symmetric difference between two sets in Python by using the symmetric_difference method or by using ^ operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A ^ B
print(C)

{3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.symmetric_difference(B)
print(C)

{3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}

# The symmetric difference, unlike the regular set difference is symmetric. 
# Which means that the order of sets does not matter, the result will be
# the same, as illustrated by code below.
A = {4, 5, 6, 7, 8, 9}
B = {6, 7, 8, 9, 10, 11}
C = A ^ B
D = B ^ A
print(C)
print(D)

{4, 5, 10, 11}
{4, 5, 10, 11}

Additional remarks

Just like arithmetic operations are defined using two numbers, but can be extended to lengthier expressions, the set operation can be applied to multiple sets. In Python the easiest way to encapsulate this concept is using the set operations |, &, -, ^ and the appropriate () to group those operations. Examples below illustrate this idea.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = {3, 5, 7, 9, 11, 13, 15, 17, 19}  # C is the set of odd numbers > 1 and < 20.
D = {3, 6, 9, 12, 15, 18}             # D is the set of numbers divisble by 3, > 0 and < 20.

# This set is the union of C and D minus A. 
# Effectively it will contain numbers divisible by 3 or odd, that lie between 1 and 20,
# but will not contain the numbers that are prime.
E = (C | D) - A 
print(E)

{6, 9, 12, 15, 18}

# This set is a symmetric difference of A, B and difference between
# C and D. 
# Effectively it will contain numbers that are either prime, or even, or
# odd, but not divisible by 3. However, it will not contain numbers that satisfy
# more than two of those conditions at the same time (i.e. it won't contain 11,
# since it is both prime and not divisible by 3).
F = A ^ B ^ (C - D)
print(F)

{3, 4, 6, 8, 10, 12, 14, 16, 18}

Use Cases

We went through the trouble of learning the definitions for basic set arithmetic (the Operations section) and membership and subset relations (the Membership and Subsets section), so now is a good time to present some use cases for these structures and operations.

Besides being an essential building block in modern mathematics, sets often present a highly convenient data structure in programming. The examples below will guide you through some useful applications of sets in programming. Some of these examples are inspired by real production code.

Filtering down unwieldy lists

Sometimes we are faced with the problem of filtering a rather large list to only show unique values. A few common examples include the following:

Identifying the unique caller IDs for a large list of phone calls.
Identifying categories of the items carried by a store from the full inventory list.

Below we will address both of the problems by leveraging the property that a set contains distinct elements, and hence will effectively filter out only the unique elements.

# Problem 1.
# ----------
# Write a function that takes in a list of phone numbers (as strings),
# and returns a list containing the unique phone numbers from the original
# list.
#
# Input: list of phone numbers.
# 
# Output: list of unique phone numbers.
def phone_id_unique(numbers):
    unique_numbers_set = set(numbers)
    unique_numbers_list = list(unique_numbers_set)
    return unique_numbers_list

# Problem 1.
# ----------
# Tests:
#
# 1. Input:  ["800-000-0000" repeated 1 000 000 times]
#    Output: ["800-000-0000"]
test_input = ["800-000-0000"] * 1000000
print(phone_id_unique(test_input))

# 2. Input:  ["800-100-0000" repeated 1 000 000 times, "800-200-0000" repeated 1 000 000 times, ...,
#             "800-900-0000" repeated 1 000 000 times]
#    Output: ["800-000-0000", "800-100-0000", ..., "800-900-0000"]
test_input = []
for i in range(1, 10):
    test_input = test_input + ["800-{}00-0000".format(i)] * 1000000
print(phone_id_unique(test_input))

# 3. Input: ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
#    Output: ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
test_input = ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
print(phone_id_unique(test_input))

['800-000-0000']
['800-200-0000', '800-500-0000', '800-700-0000', '800-300-0000', '800-600-0000', '800-400-0000', '800-800-0000', '800-100-0000', '800-900-0000']
['800-020-0000', '800-000-0000', '800-010-0000', '800-030-0000']

# Problem 2.
# ----------
# Write a function that takes in a list of store carried product (as dictioanries),
# and returns a list containing the product categories that appear in the original
# list.
#
# Input: list of items.
# 
# Output: list of product categories.
def product_categories(items):
    categories_set = set([item["category"] for item in items])
    categories = list(categories_set)
    return categories

# Before testing we will load some data from .csv files. These files should be put into
# the same directory as the notebook. CSV stands for comma-separated values, and is a 
# common standard for representing data in text format.
items = []

import csv
with open("data_produce.csv", "r") as f:
    reader = csv.reader(f, delimiter = ",")
    for line in reader:
        items.append({"id": line[0], 
                      "category": line[1],
                      "stock": line[2],
                      "price": line[3]})

# Tests:
#
# 1. Input:  [1000000 items from 8 categories]
#    Output: ["perishables", "water", "kitchen", "furniture", "electronics", "paper", "pantry", "misc"]
print(product_categories(items))

[' pantry', ' furniture', ' water', ' kitchen', ' perishables', ' paper', ' electronics', ' misc']

Implementing common logical operations

Mathematical logic and set arithmetic are tightly connected. This allows us to use set arithmetic to model common logical operations, which in turn can easily encapsulate some everyday tasks we want to perform with out data.

Set union is analogous to logical OR operation, set intersection to logical AND, and the symmetric difference is analogous to logical XOR (exclusive OR) operation. Thus, we can use these operations to translate common tasks into set operations. Let us look at some of the examples below.

# Problem 3.
# ----------
# Write a function that takes in a set of items on mom's shopping list,
# a set of items on dad's shopping list, a set of items already bought by
# mom, a set of items already bought by dad, and finally a set of items
# that are currently in the fridge. The output should be a consolidated
# shopping list, i.e. it should only include the items that are not in the
# fridge and are not yet bought. 
#
# Input: 5 sets of items as described above.
# 
# Output: list of items that need to be procured.
def shopping_list_cons(mom_to_buy, dad_to_buy,
                       mom_bought, dad_bought,
                       in_fridge):
    all_to_buy = (mom_to_buy | dad_to_buy)
    all_bought = (mom_bought | dad_bought)
    to_buy = list(all_to_buy - all_bought - in_fridge)
    return to_buy

# Problem 3.
# ----------
# Tests:
#
# 1. Input:  mom_to_buy = {"apples", "candy", "chicken", "beef"}
#            dad_to_buy = {"candy", "beef", "bread", "cola"}
#            mom_bought = {"bread", "biscuits"}
#            dad_bought = {"milk", "coffee"}
#            in_fridge  = {"eggs", "chicken"}
#    Output: ["apples", "candy", "beef", "cola"]
mom_to_buy = {"apples", "candy", "chicken", "beef"}
dad_to_buy = {"candy", "beef", "bread", "cola"}
mom_bought = {"bread", "biscuits"}
dad_bought = {"milk", "coffee"}
in_fridge  = {"eggs", "chicken"}
print(shopping_list_cons(mom_to_buy, dad_to_buy, mom_bought, dad_bought, in_fridge))

['apples', 'candy', 'beef', 'cola']

Here we take advantage of some set operations to solve the problem. First, we consolidate both “to buy” lists taking their union, thus ensuring that all items are accounted for and none are double counted. Then we consolidate the already bought items by taking another union. Finally, we take the difference between what we need to buy and what is already bought or already in the fridge.

In the next problem will take a look at some applications of the symmetric difference.

# Problem 4.
# ----------
# Students at Chicken Soup High-school are offered two
# options for Calculus classes. There is an "Intro to Calculus"
# class and "Calculus" class. Some students take only 
# the first class through their time in high-school,
# some only take the second class, by placing out of the
# first one, and finally some students take both classes
# as a sequence.
# At the end of each year, average for these classes performance
# is computed to evaluate effectiveness of instructors. 
# The average is computed according to a strange formula,
# because statistics and performance department of 
# Chicken Soup high loves hard to understand numbers.
# You are provided with the formula and the list of 
# students in each class and their grades. To protect
# students' privacy you are given unique StudentIDs.
# 
# AVG = (avg grade for students who only took "Intro") +
#     + (avg grade for students who only took "Calculus") +
#     + 1.75 * (avg grade for students who took both)
#
# Write a function that takes in two dictionaries of 
# StudentIDs and grades and computes the average according
# to the given formula.
#
# Input: a dictionary for students who took "Intro to Calculus",
#        a dictionary for students who took "Calculus".
# 
# Output: the average grade.
import numpy as np
def chicken_soup_high_avg(intro, calc):
    one_course = set(intro.keys()) ^ set(calc.keys())
    both_courses = set(intro.keys()) & set(calc.keys())
    one_grades = []
    both_grades = []
    for studentID in one_course:
        try:
            one_grades.append(intro[studentID])
        except:
            pass
        
        try:
            one_grades.append(calc[studentID])
        except:
            pass
    
    for studentID in both_courses:
        both_grades.append(intro[studentID])
        both_grades.append(calc[studentID])
        
    avg = np.mean(one_grades) + 1.75 * np.mean(both_grades)
    
    return avg

# Problem 4.
# ----------
# Tests:
#
# 1. Input: {"1": 4.0, "2": 3.75, "3": 3.4},
#           {"2": 3.8, "3": 3.0, "4": 4.0}
#    Output: 10.10
print("{:.2f}".format(chicken_soup_high_avg({"1": 4.0, "2": 3.75, "3": 3.4},
                                            {"2": 3.8, "3": 3.0, "4": 4.0})))

10.10

Further remarks

While not strictly necessary, sets can make several classic graph algorithms easier to write and explain. We will cover those in the graphs section.

Downloads

All files as a .zip archive.

Downloads

All files as a .zip archive.

Notation	Description
$\mathcal{X}$	a finite set of taxa (for non-bio folks: a set of labels)
$T$	a unrooted phylogenetic tree, i.e. a undirected connected graph $G=(V, E)$ with no cycles and in which $\mathrm{deg}(v)\neq 2, \forall v\in V$ together with a leaf labeling $\lambda:\mathcal{X}\to V$ that assigns exactly one taxon to each leaf and none to any internal node
$\omega$	a weight assignment for a given phylogenetic tree $T=(V, E, \mathcal{X}, \lambda)$ , i.e. a map $\omega: E\to \mathbb{R}$ (in practice we are interested only in the case $\omega: E\to (0,\infty)$ )
$D$	a pairwise distance matrix for the set of genomic sequences corresponding to the set of taxa $\mathcal{X}$

	A	B	C	D	E
A	0	0.158	0.158	0.158	0.124
B	0.158	0	0.044	0.075	0.124
C	0.158	0.044	0	0.107	0.107
D	0.158	0.075	0.107	0	0.059
E	0.124	0.124	0.107	0.059	0

	A	B	C	D	E
A	0	0.158	0.158	0.144	0.121
B	0.158	0	0.027	0.057	0.079
C	0.158	0.027	0	0.071	0.071
D	0.144	0.057	0.071	0	0.034
E	0.121	0.079	0.071	0.034	0

Nicolae's Homepage

Tag: Python

Phylogenetic Vignettes: UPGMA and Neighbor-joining

Preliminaries

Distance based tree inference

Least squares

UPGMA

Neighbor-joining

Exploring sequence distance metrics

Jukes-Cantor model and distance

Mash distance

Comparing UPGMA and Neighbor-joining for JC distance

Comparing UPGMA and Neighbor-joining for Mash distance

Data and code availability

What’s next

References

Sets

General definitions

Set Membership and Subsets

Set Operations

Set Union

Set Intersection

Set Difference

Set Symmetric Difference

Additional remarks

Use Cases

Filtering down unwieldy lists

Implementing common logical operations

Further remarks

Downloads