2020: A different kind of year in review

Good evening, world. First of all, this is not exactly a review of the year 2020, but rather a review of a specific period roughly starting around mid-March 2020 and lasting till today, mid-March 2021. The “lasting till” part is not particularly meaningful, since as of today the future is not much more certain than it was a year ago, but it rather demarcates an amount of time I am willing to write about and feel the need to write about. As per usual I will try to break this down into sections for easier navigation, although this time the sections are likely to be chronologically rather than thematically ordered.

March-April 2020

Around beginning of March I was self-isolating in my apartment, so by the time WHO declared COVID-19 a pandemic and Rice University has suspended the classes (to be followed immediately by the Spring break) I was already settling into the realities of the work from home routine. I already have been working on SARS-CoV-2 sequence analyses for about 2.5 months by then, and was steadily getting deeper into the genome and variant analyses and related topics. By mid-April we have been working in collaboration with Baylor College of Medicine and Weill Cornell Medical College on the study of intrahost genomic variation in SARS-CoV-2.

Outside of the immediate research work this period was also full of the challenges of the schoolwork adjustment to the virtual environment. On my end most of these transformations were not particularly impactful, since in the spirit of good old college days I simply stopped attending most of the lectures opting for a 2x rewinds of recordings at convenient times. However, for the seminar-style classes the impact was noticeable. Giving a presentation into the void of a Zoom call felt eerie and rather disappointing, a feeling that hasn’t really left me ever since, but gained a new set of characteristics as the Zoom fatigue became more of a well-known phenomenon.

May-June 2020

By early to mid-May classes were mostly wrapped up, and to the credit of my instructors for this semester, I did feel that I managed to learn something new and exciting, even if a good chunk of the process had to be scrambled and re-built in less than two weeks. On the flip side, the Zoom call festival was just picking up. This time period was likely the busiest so far for me with respect to the number of Zoom calls I have been present on. In retrospect I can’t help but feel that in some way the work from home routine and initial panic about “social distancing” (which in fact really is more of a physical distancing) made a lot of people crave the meetings a bit more than they should. Nevertheless, a lot of the meetings I had attended in this period were instrumental for my work, and helped greatly in shaping the initial draft of the paper that we ended up submitting towards the end of June.

This two month stretch was also the most work intense part of the whole year and arguably that was in big part made possible by the work from home imperative. Without spending time on grocery shopping (we were rather panicked at that time and relied on Amazon Fresh…), bar hopping, museums, or pretty much any activities outside the apartment (except periodically joining for walking the dog), I could funnel all of my waking time into research, interspersing it with cooking and home-brewing to main an ounce of sanity. Needless to say, that was also only possible since the exhaustion from the said routine haven’t kicked me hard yet, and I could indeed focus for a solid 8 to 8 on a typical day. Well, you probably know what’s coming next.

July-August 2020: The swamp

First, I hate Houston summers! Second, I hate Chicago summers. Third, I am sure that I just genuinely do not enjoy summers, in all ways shapes and forms, since graduating high school. “The swamp” comes, of course, from the great humid and hot weathers of the bayou that I am living upon nowadays, but also from the general feeling I had throughout that period of time. The April-June push took it’s toll on me, and I was not in the best working mood for the rest of the summer. I mostly concerned myself with another collaboration, this time looking into the Influenza A virus. However, my overall progress was slow, and on certain weeks I could barely squeeze out 8-12 hours of solid work for the entire week.

Additionally, dropping out from a work only mindset made me observe the realities of the world a bit more clearly and with COVID-19 cases spiking in Texas and globally, racial injustices being brought up front and forward in the USA, and the relatively shaky situation with the visa policies those realities were not pretty. For people who might have had more energy at that point in time it might have been a time of action and/or reflection, for me it was mostly a time of sadness, discomfort and a general overwhelming feeling of “the swamp”.

September-October 2020

With summer being over the air in Houston became more breathable, but my exhaustion was still going strong. Beginning of classes helped alleviate that to some extent, and at some point almost re-energized me to the April levels of productivity. That period was relatively short. The month of September went by relatively uneventful for me, although I had to take care of a few things in the preparation for October.

October was hectic. My parents were visiting me from Moldova (a trip they booked way before the pandemic, and re-scheduling which would be an even bigger mess), so I had to figure out safe housing options and some form of an itinerary for their visit. On top of that, I was going to give a talk at a Ken Kennedy Institute Data Science conference mid-October, and that required some polishing and prepping. At the end of the day, both went smoothly, my parents were lucky to visit roughly in between two major COVID-19 waves in Texas, they took proper precautions, and most of their visit consisted of a road trip and hiking in the Grand Canyon/Sedona area (which I could not join, due to the aforementioned talk). The talk went by fine, as by that time I got accustomed to the practice of speaking into the Zoom void.

Towards the end of this period I felt more accustomed to the new life. Not the “dive into work” accustomed as in the beginning but rather routine accustomed. I had my balance of work and off-days, meetings I wanted or needed to attend and the ones that could be skipped, and some workable sense of interaction with the world through mostly digital means.

November 2020

Most of the November was consumed with responding to reviewers’s feedback on the SARS-CoV-2 paper, tying loose ends on other projects, and getting into the groove of wrapping up schoolwork. Importantly, I also finally got back to reading (no, scientific papers don’t count, even if I do enjoy a solid 60-75% of them). I finished my way through the I want to be a mathematician: An automathography by Paul Halmos, which was a nice and witty book and to a degree helped me in the process of recovering some of the energy and excitement about work. I also finally completed the reading of A Random Walk Down Wall Street, which while not extremely underwhelming was somewhat of a medium tier prose in my opinion. Also inspired by my parents adventures in the parks of Arizona I started planning a short retreat of my own into the wilderness.

December 2020

Classes were now officially done for the semester, revised manuscript has been submitted, and I was ready to take a week off. We have travelled West and South to a small town of Terlingua, TX, where we found a nice clean little motel. The two main attractions (both at about 20-40 minute drive from our base) were the Big Bend National and State parks. Both spots were amazing, with nice mountain hikes, desert views and scenic drives along Rio Grande it was a good way to let my eyes rest for a bit. The location was not busy with people making hiking in these places a much more insular and reflective experience (not to mention that it was also nice to travel without too much worry about COVID circumstances).

The rest of December went by fast and wasn’t particularly eventful.

2021

At the start of the new year I was getting back into the groove of things trying to spend a bit more time walking outside in parks and along the bayou. Work got into a steady and solid flow and with a new semester picking up, I got a few more things to work on, which brought a bit of fresh air into the workflow. I also continued with reading more, just finishing Hilbert by Constance Reid and setting my eye on a few fiction books, since I have been missing out on reading fiction for a while. I also had a great opportunity to give a talk at UChicago Center for Translational Data Science discussing the SARS-CoV-2 work we did over the past year. In general, even though in the global sense life is still far from stable, I am feeling like I have managed to find some local stability, which makes me somewhat optimistic for the next few months to come.

Other notes and miscellanea from past year

I heavily ignored personal projects throughout this period. Well, at the very least the ones I’d typically write about here. I did spend a lot of time cooking, had a long foray into home brewing mead, and dabbled in some music for a bit at different points in time. The closest thing to come to the personal projects scale was getting comfortable with ggplot2 and RShiny, but at the current moment neither really got beyond the level of purely professional engagement. I think that a rather sizable increase in screen time drove me away form my standard set of personal projects, since I got reasonably tired of spending time in front of my laptop with the regular work and study activities.

Talks in the Zoom age: Speak into the void and let the void speak back

I don’t think that this note warrants its own post, so I will tuck it in at the end of this review for those interested in my opinion on the Zoom talks.

At the start I felt discouraged by the Zoom talks. First, they completely remove the energy feedback loop between presenter and the audience, hence the “speaking into the void” claims. Not to mention, that the same problem of speaking into a device, also makes it somewhat physically uncomfortable for the presenter, since you need to maintain relatively same posture, limit hand gesturing and stare into the tiny dot of a webcam for the “engaging with the audience” feeling (to be precise, the feeling is not for you, it’s for the people watching the talk). Second, Zoom fatigue is very real, and concentrating on a speaking screen can be hard for some people (including me). Hence, “the void” is really often empty, with just a few people tracking with the presentation for limited blocks of time. In the meantime I developed some techniques that help me stay more engaged, including chat discussions, sidebar banter with friends attending the same talk, and paper and pen note taking. All of these help to a degree, but nevertheless the problem of fatigue still stands. Third, just as with the regular talks, when “the void” speaks back the questions are often disconnected form the talk, and without the clear prospect of the “taking it offline” it feels that some interesting in-depth discussion never end up happening.

However, with time and practice comes some form of acceptance. I ended up finding the positives in Zoom talks as I attended and given more of them. First, it really helps with boosting my confidence. People not asking questions or being disengaged doesn’t burn the same way when their cameras are off as they fall asleep or switch to watching TikToks. Second, for lectures (on the receiving end) Zoom or YouTube is great. You can play the recording at the speed you want, rewind, and replay the fragments as you need. Interactiveness takes a hit, but as long as there is some venue for asking more questions it is really convenient. Third, Zoom really opens up the possibility of attending talks globally. I’ve been to Berkeley, Chicago and a couple other places to attend the talks without the cumbersome process of getting onto a plane.

At the end of the day, I hope that some changes that we see Zoom bring into the world of lectures and talks are here to stay, since not all change is bad. Of course, I also hope that at some point later in the year, I will be able to grab a beer after a talk to discuss the finer points with the presenter (not over Zoom).

Week(ly quick)ies: Random walks

Introduction

Since my blog has been experiencing a shortage of content (mostly due to the fact that I am busy with my current research work), I decided to try out a new format that will hopefully motivate me to post more often. “Weekies” i.e. “weekly quickies” are going to be weekly posts of small puzzles that I come up with or come across on the web. I plan to follow up each “weeky” with a longer write-up of solution and discussion roughly within a month or so of the original post. In the meantime do not hesitate to post solutions and discuss the puzzle in the comments under the original post.

Random walks

Consider the K4 graph/tetrahedron depicted below. The following questions concern different random walks starting at the vertex A.

Q1. Assume that at any time step we will move to a random neighboring vertex with the equal probability of 1/3. What’s the probability that we will visit all vertices of the graph before returning to A?

Q2. Given the same walk as in question 1, what’s the probability that we will visit any one vertex out of {B, C, D} at least n times before returning to A?

Q3. Now, let’s change our walk to have a 1/2 probability of remaining at the current vertex and probabilities of 1/6 to move to each of the neighboring vertices. How do answers to questions 1 and 2 change for this walk?

Q4. With the walk described in question 3, what is the expected number of steps we need to take to visit all vertices at least once?

Q5. Let us record the sequence of visited vertices as a string of characters A, B, C, D. Consider a random walk in question 3 and record its first 10 vertices. What’s the probability that the string “ABCD” occurs as a subsequence in the recorded sequence?

Q6. We can generalize the walk described in question 3 using the parameter q ∈ [0, 1] by letting the probability of staying at the current vertex be q and probability of going to any given neighbor be (1-q)/3. Additionally, let the start vertex be picked uniformly at random. Let n be the length of the sequence of vertices that we record, analogously to question 5. Let Pq,n be the distribution on the Ω={A, B, C, D}n generated by the described process. Let Un be the uniform probability distribution on the Ω. Let D be the total variation distance function and let us denote d(q,n) = D(Pq,n, Un). Investigate behavior of the function d : [0, 1] ⨉ ℕ → [0, 1].

Monte Carlo your way out of a puzzle: Why events of measure 0 matter.

I was sitting in my office, a couple of weeks ago, minding my regular bioinformatics flavored business with a cup of oolong tea, when Senthil stopped by and told me that he has a puzzle he’d like some help with. Now, both me and Senthil enjoy math and we especially enjoy small nifty tricky problems. Thus, the puzzle he shared was the following.

Consider a unit grid on a plane. Pick a non-negative real number L. Now, we can place a random segment of length L in the plane by picking a point and an angle uniform at random and drawing a segment of length L originating at the chosen point and oriented based on the chosen angle. Which value of L maximizes the probability that this segment will intersect exactly one line of the grid?

We spent about 5 minutes chatting about different placements of the segment, probabilities, and our lack of desire to compute a nasty triple integrals. We had to wrap up quickly since, I was short on time and had to finish up some other tasks.

Last Thursday I was co-organizing our department’s happy hour, and Senthil reminded me of the outstanding problem. I made a pinky promise that I’ll help on Friday, which brings us to the story I am about to tell you.

Do not integrate, just simulate!

Since, I was not sold on the idea of computing integrals first thing on a Friday morning, I decided to quickly write up a simulation that would do exactly what the problem statement described. After about 20 minutes of typing away, I had my tiny C program up and running and spitting out probabilities with 6 sweet significant digits. Senthil came by quite soon, and within 40 minutes of back and forth of computer assisted linear search we were quite certain our answer for L was 1. Yay, problem solved, L = 1, probability is approximately 0.63774, and we can all go drink coffee and do whatever Computer Science PhD students do on a Friday. Well, no.

Of course we were not happy, I meant what is 0.63774? Have you heard of this number before? Does it have any meaning? Is this the number of goats you must sacrifice to Belial before you learn the great mysteries of the world? We took to OEIS and found (among other amazing things) that 0, 6, 3, 7, 7 is a subsequence in the decimal expansion of Mill’s constant, assuming the Riemann hypothesis is true. Other than that the number didn’t quite give us an immediate answer. We also checked the magic of WolframAlpha, which gave us some neat ideas, like (3 π)/(2 e^2). However, as you can guess reverse engineering this fraction was 1) stupid; 2) hard; and finally 3) something we didn’t do.

In the meantime, we decided to put our theory hats back on and figure out what is happening, and how we can get an analytical expression of this probability. Senthil recalled that in fact there is a very similar well-known question, called Buffon’s needle problem, which has an amazingly clean solution. However, we could not see how one can easily translate the setting of that problem to ours. After some amount of drawing, complaining about integrals and debating which events are measure zero, we decided to take a coffee break and walk outside for a bit. This is when the major theoretical breakthrough happened.

Senthil noticed that instead of thinking about the grid as a whole, we can consider two sets of parallel lines independently. The advantage of this view was clear, now we have the Buffon’s needle problem embedded into ours directly! We multiplied some probabilities, got wrong numbers and complained about futility of theoretical efforts. Then we of course reimagined how stuff should be multiplied and counted things again, and again, and again. Are the events independent? Do we count this in or subtract it out? Do we go back to the integral formula?

Clearly the two intersection events are not independent, since the orientation (based on angle) that works well for horizontal intersection, is not favorable for the vertical one. Thus we can’t just multiply stuff out and subtract from one. Thus, we went back to the integral. This is where things got tangled up. We recomputed the triple integral a few times, but something was not tying out. We computed integral of a product of integrals and something still didn’t tie out. We have been running in circles, seeing numbers like 0.202 and 0.464 all over the place. Finally, the original answer to Buffon’s needle, the 2/π, was temptingly close to our own, since it was 0.636619. Finally, around 4pm we threw in the towel and parted ways.

Simulation clearly saved the day, and even if we were not sure of the theory behind this number, we knew the answer with high confidence, and that was already a win. But…

Back to the drawing and integration

I took out a piece of paper, folded it in four, and started sketching. There is a nice video out there that explains how to derive solution to the Buffon’s needle problem. Thus, I just went through the steps, drawing out the circle, and thinking what are the exact conditions we need to satisfy to get our segment intersect exactly one grid line.

The shaded region above represents the range of admissible positions for the needle (black arrow) when the angle formed by needle and x-axis does not exceed 90 degrees. Note that we only consider intersection with the line A_0A_1, since the intersection with the line B_0B_1 is identical by symmetry.

Now, considering the angle of the needle to x-axis to be \theta\leq\frac{\pi}{2} and distances to horizontal and vertical lines to be y\phantom{\frac{1}{1}} and x\phantom{\frac{1}{1}} respectively, we can define the conditions for the admissible position as 0\leq\theta\leq\frac{\pi}{2},\,y\leq\frac{L}{2}\sin\theta,\,x\geq\frac{L}{2}\cos\theta. Thus, we can write down our desired integral, and evaluate it as \displaystyle\int_0^{\frac{\pi}{2}}\displaystyle\int_0^{\frac{L}{2}\sin\theta}\displaystyle\int_{\frac{L}{2}\cos\theta}^{\frac{1}{2}}dx\,dy\,d\theta=-\frac{1}{8}(L-2)L. Now, dividing through by the total area and multiplying by 2, due to symmetry for the intersection to the second line, we arrive at the coveted \frac{2}{\pi}\approx 0.636619. Case can be rested, we now have a neat theoretical solution to the question, and a beautiful mathematical expression as an answer. However, I am still unhappy, because I cannot ascribe the error we see in the simulation to pure numerical precision at this scale.

The devil is in the details: wording and measure zero events

One of the first things you learn in a probability theory class is that events of measure zero don’t matter. Picking a specific point on a line, or even picking a rational point on a line are measure zero events, hence they don’t really matter when you calculate probabilities. Nevertheless simulations are not only countable, they are finite! Which means accidentally dropping in an event of measure zero that shouldn’t be there can trip you up. This is exactly what happened in our case. We want to count the number of times a random segment intersects exactly one line of the grid. Now, I assumed this condition meant instead: segment intersects the grid at exactly one point! The difference shines for the points that are intersections of the two sets of parallel lines forming the grid, because they are a single grid point, but a segment crossing this point in fact intersects not one, but two lines!

I have adjusted my counting scheme to correctly reflect this edge case as intersecting two lines, and therefore, not being a favorable outcome. Guess what happened next… Exactly, the sweet shiny 0.6366313 popped out of my simulation as the predicted probability of the event. Of course, this number is closer to \frac{2}{\pi} than our previous candidate, and now all the doubts are gone, the theory and simulation agree!

Discussion and lessons learned

First of all, if my simulation runs in about 2 minutes (going through 500,000,000 random segments), why not run a longer simulation to counter the effect of nasty accidental event? Well, numbers have their limits on computers, and overflows happen. I tried running the simulation for 2,000,000,000 and 20,000,000,000 segments. The first run gave a slightly smaller number than the original 0.6377, but still not quite 0.6366. The second one, gave a whooping probability of -8.648, which even those not too familiar with measure theoretic probability would call out as absurd. Can I spend more time bugfixing this, and get a great simulator, that can scale to 1,000,000,000,000 segment trial? Yes. Is it reasonable to pour time into this? No!

Second, why not do the theory neatly from the start? At the end of the day the problem really takes about 30 minutes of careful drawing and setup and 3 minutes of WolframAlpha-ing the integral. Well, first this involves dissecting the problem theoretically which takes time. Second, writing up the simulator took 20 minutes, for contrast understanding how the Buffon’s needle problem is solved took about the same 15-20 minutes. The latter on its own didn’t give a solution, the former at least provided a good numerical estimate. Finally, recall how the problem is actually about finding the optimal length L? Well, that was quite a block to developing nice theoretical answer, once we were quite certain L is 1, we could actually focus on the theory that mattered.

Third, why not read the problem statement carefully and not jump to wild assumptions? Well, this one goes into lessons learned bucket, alongside the entire finiteness of simulations shenanigans. You would expect that years of problem solving would teach me that, and yet again, I make the same mistakes.

Overall I can say that this was an amazing exercise in basic probability, geometry and mocking up simulations to solve puzzle problems. I am happy to learn that writing a quick program can help in figuring out theory behind a question. Furthermore, I am also quite delighted to experience nuances of probability first hand, without putting important questions and results on the line.

Final notes and resources

Many thanks to Senthil Rajasekaran for bringing this problem to my attention and guiding me through the crucial steps on getting the theoretical answer in the scopes. Looking forward to more puzzles to solve!

My code for the simulator is available in this (zipped) file. Note that the random number generator was lifted from a course by Benoit Roux at UChicago, and the rest of the code is mine. The commented out portion of the intersect function is exactly what caused the annoying of by 0.0011 error.

Thanks for the attention, and happy bug catching y’all!

2019: A year in review

Travel

This year was quite eventful in terms of travel, both domestic and international. I started off with getting on a project in Connecticut, so starting the last week of January I was flying from Chicago to LaGuardia every Monday, and flying back out every Thursday. Of course this was a prime opportunity to rack up miles and hotel points, which came in handy later throughout the year. I also used my constant travel as an excuse to visit a few of my friends spread all across the US. I made stops in NYC, Philadelphia, Austin, and Boston. I also took a trip to New Orleans, which was a long standing bucket list item for me and my friend Cris. In between these fun trips, I also squeezed in a few more career related ones, including visiting Baltimore and Houston for graduate program weekends, and returning to Baltimore again for BPS’19.

The next major trip I undertook was going back home to Chisinau, Moldova for the entire month of July. It was a relaxing and fun time, and I mostly used it to unwind after a work intensive year. It also was an interim in my moving process from Chicago down to Houston. In the beginning of August, I flew back to Chicago, packed the last few suitcases (not really) and headed south.

I got to Houston mid-August, and the entire move-in process went extremely smooth thanks to my awesome roommate Robert, who helped organize common spaces in the apartment. I took another quick trip to Boston, to enjoy the last grill session of the (northern) season, and began my studies. I then also took two trips to Chicago, one for the autumn recess and one for Thanksgiving. Both trips were fun, and coming back to Chicago felt like coming home.

Finally at the end of this year I took a trip to Spain. It was an ambitious itinerary listing more than 5 cities and mostly organized and planned by Rachel. I pre-gamed the trip by spending a day in NYC, and then headed to Barcelona. From Barcelona we visited Girona and Figueres as day trips. Both trips were amazing, and I would totally recommend them to anyone spending some time in Barcelona. Then we took a train to Madrid. It was a packed schedule, and seeing Prado in one day is obviously a challenge, but we succeeded. The next stops were all in the south of Spain. Starting from Seville (with a day trip to Cordoba), onto Ronda and finally Granada, it was a route packed with great views and amazing foods. We rounded everything up by flying back to Barcelona and then in a day to NYC.

Thus, this year beings me into the great New York City! I will be heading back to Houston soon, but in the meantime, I am going to enjoy some bagels, pizza and public transport.

Work

I spent the first half of this year working on a project as a consultant with TruQua Enterprises, which involved quite a bit of travel and interactions with clients. It was an engaging opportunity to learn more about inner workings of a large business, and assist people by providing technological insights into their processes. I feel like I learned to be a better speaker and conversationalist, as well as, an attentive and active listener. Even if some days were slower or more heavy on the grunt work, I feel like I grew a lot through this experience.

In the second half of the year I have started my PhD in Computer Science at Rice University. My first semester was a combination of learning from courses and from peers in the lab. I took a great course on optimization taught by Tasos Kyrillidis. It was a mix of nice and fun math with solid practical motivation coming from the field of machine learning. I would absolutely recommend this course to anyone who has a chance of attending and is at all interested in anything mathematical or machine learning related. I also spent a lot of time working on a project in Treangen Lab, which was both stimulating and fruitful. I learned a lot of new material through this work, and strengthened some of my skills in data preparation and visualization.

Research

In the first half of this year I continued my work on the conformational transition pathways of insulin degrading enzyme and in early March I presented a poster on it at Biophysical Society Annual Meeting. It is an exciting project and it involved a lot of hours and effort both in learning and implementing simulations and analysis. I have continued this work by conducting more analysis, proposing a coarse grained model and starting out a set of swarm of trajectories simulations for determining the conformational transition dynamics of this protein. As the year progressed, I got busy with other projects and my current involvement with the project became minimal. I am currently finalizing some write ups and planning on handing over the project to the next students.

At Rice, I joined in on a project that involved graph theoretic analysis of metagenomic read data. It was my first research experience in genomics and metagenomics and I felt amazing about it. I was happy to contribute my knowledge of graph theory and general computer science to this exciting area. This project was developed in collaboration with Advait Balaji, a second year PhD student in Treangen lab. Throughout the entire process I had an amazing support from my mentors and our collaborators.

Overall, I’m happy to continue my work in the area of computational genomics and metagenomics. This is an exciting area, and I feel like I can both learn a lot and contribute significant work in the process.

As usual my research interests remain broad and lie at the intersection of mathematics, computer science and biology. I am looking forward to developing more work that can be directly applicable in both research and clinical environments, but of course that is a lengthy and complicated process.

Personal projects

This year not too many of my personal projects saw light, although I made a lot of progress on some. Overall, I feel that I carved out a few main directions for my personal work and I plan to strengthen and pursue those further in 2020.

First major initiative is exploration and adaptation of different data visualization techniques and tools. I have learned more about color pallettes and colorblind friendly design. I feel that this will make my subsequent projects more aesthetically pleasing and accessible. I also plan to round up a solid review of some basic data analysis pipelines and eventually release it as useful resource for my lab, as well as the community at large. Finally, I plan to finish up my “Where in the world” project, which was a foray into D3.js and geographic visualization.

Next, I am looking to reinforce my paper sorting and reading habits. I was meaning to organize my personal digital library for a while, but was falling short on convenient tools to do so. Therefore, I might eventually to settle either for a mix of practices and tools, or possibly write up some code myself to further streamline the process.

I haven’t worked much with Raspberry Pi or Arduino with the exception of Scav projects this year. While home automation appeals to me, I am not sure I’ll have enough time to invest into it next year, so I think most of these projects will remain on the shelf for a while.

Finally, this blog while not too active, will stay alive and will be an outlet for updates throughout the next year.

Thanks for reading, and see y’all in 2020!

Update on my whereabouts

Good evening world! Long time no see.

A lot of stuff has happened since I the last time I have updated this blog. I’ll go through most of it briefly while focusing on the things relevant for the blog content and my current plans for the upcoming few months.

I gave a super brief lecture on Fourier analysis to several first year students at the University of Chicago (non-math majors). The full write up for this lecture is long overdue, but I hope to finish it some time soon and post the full thing on here. As per usual, since I was trying to include several worked examples into the write up it got bloated very fast and as a result, I wasn’t able to finish it on time.

I have also developed a lot of new content for teaching basics of Python and programming in general. I plan to roll out those as a series of bi-monthly write-ups with Jupyter notebooks attached. The tentative idea is to post those on the first and third Sundays of the month.

I have finished up my job in consulting just a bit too shy of seeing the project go fully live. I had a wonderful time at TruQua Enterprises, and I am glad that my work made a meaningful contribution to the cause. I am not sure whether I’ll end writing up a note about my experiences and what I have learned while working in consulting, but if it happens I’ll share it here as well.

I also traveled back home to Moldova for a month. It was a relaxing experience and I got a chance to be a tourist in my own country visiting two wineries and touring their production and maturation facilities (with a tasting included). I also wrote up a set of problems for the IOI training camp, but due to the issues with creating enough testing sets, those were not used yet. Once the full set has been used by the team for the training purposes, I will make the contents of it, both problems and solutions available to the general public.

Last but not least, I moved to Houston, Texas and have started the PhD program in Computer Science. I am finishing up the orientation week activities and getting ready for an exciting semester ahead of me.

Cheers, and see you soon.

Teaching kids to code: I’m a computer scientist and I think it teaches kids important skills

Good afternoon world!

To those of you who know it doesn’t come as a surprise that I care a lot about teaching, in particular mathematics and computer science. Recently while browsing r/programming I came across an article that gives a perspective of a software developer on why you should not teach your kids how to code. I think the article brings up several good points, but is not quite complete and draws a conclusion with which I thoroughly disagree. Thus, I have decided to present my take on the issue, and point out a few nuances which I think are important to consider when making claims about usefulness of teaching coding.

Why is teaching coding a bad idea?

First of all, I need to say that the article is well written and definitely tackles some major issues with the hype around learning how to code. However, I would like to take a closer look at the following paragraph.

A former co-worker of mine was trained at a coding boot camp with the motto “Coding Is the New Literacy”. That sentiment is at the heart of all the programming books and games. The description on one popular book says starting coding early is “essential to prepare kids for the future”. This gives the impression that not teaching kids to code is somehow equivalent to not teaching them to read.

Joe Morgan, Slate.com

I think this is a valid criticism of aggressive advertisement of coding boot camps and resources target at children. However, the language of “new literacy” can be taken apart from a different angle. Just like we learn basic reading skills, we also learn basic mathematical reasoning at the young age. In many ways coding is a field that brings together the abstract mathematical reasoning and applied results. Hence, while the branding of “new literacy” is misleading, we can think of coding as a rather “old literacy” repackaged into a modern set of tools and scenarios.

In particular, a good amount of the critique presented in the article is aimed at learning syntax of particular language rather then general problem solving. I do agree that teaching coding, regardless of the age group in fact, should be aimed at cultivating problem solving skills and developing abstract thinking. However, the second point is exactly why I disagree with the article’s author. While we talk about problem solving a lot, it seems like a common pattern to think of teaching problem solving in the context of applied life skills: assembling furniture, cooking, playing with LEGO, etc. However, what many of these examples lack is the abstraction aspect that is essential to mathematics and computer science. Even the notion of something so natural as a natural number hides in itself a deep abstraction leap that is often taken for granted. When we think of the number 3, we are thinking of an abstract notion of count. Three apples are inherently different from three oranges or three chairs, but we are thinking of some abstract property that unites all of these examples, namely the count. The number 3 does not exist on its own, we can’t create a palpable 3, but we still are capable of thinking and knowing exactly what we mean by it.

Hence, mathematics and by a natural extension coding is not only about problem solving, drive and perseverance. It is also about abstract thinking, which is something that needs to be cultivated early on. I have encountered multiple people who struggle in college level proof-based mathematics classes, because the school system has failed at teaching them abstract thinking and reasoning to a satisfactory degree. I want to reiterate, that it is not a flaw within those people, and it is not some special quirk of mathematics as a subject. Anyone can learn mathematics, and everyone should learn some basic skills from it. The most valuable skill being exactly the power of abstract thinking.

So what exactly is abstract thinking?

It is hard to define exactly what do I mean by abstract thinking, but there are a few common trends that occur throughout examples of it. First, there is a pattern recognition part of any abstraction. Namely, an abstract property arises often as a natural recognition of a pattern in observed world. For example, we can go back to counting example. We observe a certain natural property of the objects that surround us. They can appear in different quantities. One way of abstracting the idea of quantity is precisely counting. When we think of apples on the table, we can consider their individual volumes (another abstraction) or masses (abstraction again), but perhaps we can also consider the number of individual apples, i.e. their count. We recognize some pattern to our world, and create an abstract concept to reflect it.

Now, there is a second common trend, classification or identification of the equivalence classes of patterns. This sounds complicated and is probably peppered with strict mathematical definitions (listed in Appendix). However, I am arguing that in fact it is one of the most natural things that people do! This idea was brought to my attention by prof. Laszlo Babai during one of his lectures on discrete mathematics. We do notice same colors, and group things based on the color, without realizing that in fact we are identifying an equivalence class. We do recognize that three apples have the same count as three oranges, therefore identifying an equivalence class among things that can be counted, a class that we simply call 3. The same can be said about 5, and 7 and so on. We identify an abstract equivalence through observation of natural world.

The third commonality of abstractions is generalization or cross-applicability, if you wish. Once we develop an abstraction, we start noticing it new places, and realizing that the same logical process can be repeated and applied anew to a different scenario. First, let me tell you a classic joke.

Working late at night, a mathematician doesn’t notice how one of her candle tips over and sets some of the papers on the table on fire. Once she realizes what is going on, she rushes to the kitchen grabs a glass of water, pours it over the table and extinguishes the fire.

A few weeks later she has her friend over for an evening of whiskey and cigars. The careless friend throws a still lit match into the trashcan setting some papers in it on fire. The mathematician dumps out the flaming papers on the table, rushes to the kitchen for a glass of water, and then puts out the fire.

Her puzzled friend asks: “Why did you dump the papers on the table first?”

Mathematician replies: “Oh, I just reduced the problem, to the one I have solved two weeks ago!”

Folklore

This is a classical example of reducing to the problem previously solved, or if thought about slightly differently, recognizing the applicability of the same abstract pattern to a new case. In our apple arithmetic example, we can think of the following: we already realized the abstract notion of the numbers 3 and 5, and the pattern of them forming the number 8 when put together. Now, if we suddenly find ourselves with the same pattern for oranges, we already will know the answer 3 + 5 = 8. What helps us is the abstraction (the object doesn’t matter, the count does) and its generalization to any type of countable objects.

Thus, while not exactly answering what abstract thinking is I outline three important aspects of it, namely: pattern recognition, equivalence recognition, and generalization.

How does one teach kids to develop abstract thinking and what all of this has to do with coding?

We are incredibly lucky here, because a lot of basics of abstract thinking come to us for free as a perk of being human. Furthermore, a lot of basic children literature is already aimed at developing skills tied to pattern and equivalence recognition. The generalization of the abstractions on the other hand is not always common in early teaching, and is one of the important aspects of mastering abstract thinking. Kids would often struggle with basic algebra concepts, such as a variable or unknown. What is important is teaching them in a way that allows these notions to be genuinely recognizable as common patterns, that simply generalize the rules already learned. In that line of thought, a function f(x)=x+2 can be though of as the notion of taking some number of objects in a countable class and adding two more of the same object. Adding two apples to how many apples you already have (but recall, apples did not matter in the end).

So how does coding tie into this entire story? Well, coding in itself is full of abstractions, and therefore presents a rich playground for maturing the concepts and ideas of abstract thinking. However, unlike mathematics or physics, coding has a unique aspect to it that allows us to see practical implications of our abstract reasoning.

It is exciting to see how some words that you wrote turn into a circle on your screen, or a square, or a flag (more on that in the next post). However, it is also important that this exemplification allows kids to solidify and check their abstract reasoning. A for loop is an abstract construction that allows you to repeat an action some prescribed number of times. It is good to have an understanding of this abstraction, but solidifying it by seeing how changing the number of repetitions, changes the number of figures drawn is extra nice. It brings back that natural observation component to the abstract thinking, which should enable a young learner to thinking creatively and develop a new graspable connection between abstract generalized concepts and tangible everyday observations.

Conclsuion

Coding gives us an opportunity to learn abstract thinking while continuously supporting it with natural observations. In the similar way to cooking and tinkering with LEGO, we get a combination of ideas and observable consequences within one process. We should shift the aim of coding books and boot camps for children from “one true answer” syntax oriented problems, to thinking and skills oriented puzzles. The goal of such education is not to foster a particular skill in a programming language, but to create a thinker, who can notice patterns and effectively generalize them to new problems encountered.

We have an amazing tool that can easily grasp attention and provide a rich and exciting framework for learning. Instead of shunning it due to novelty or perceived shallowness, we should think about how we can use it to teach and learn what is truly fundamental: the abstract thinking!

Appendix

A little bit of dry math for the formal definitions to some of the stuff I have been talking about. Keep in mind that these highly formal and abstract definitions in fact tie back to our natural observations.

Definition. A binary relation R is defined as a set of ordered pairs (a,b) of elements a from a set A and b from a set B, i.e. a subset of A\times B. We say that two elements a,b are related, denoted aRb iff (a,b)\in R.

Definition. We call a binary relation R defined on the pair of sets A and A an equivalence relation iff it satisfies the following properties:

  1. Reflexivity: aRa
  2. Symmetry: aRb \implies bRa
  3. Transitivity: If aRb and bRc, then aRc.

Definition. Given a set A, an element of that set a and an equivalence relation on this set \sim. We call the set [a]=\{x\in A|x\sim a\} the equivalence class of the element a.

Definition. A partition of a set A is collection of disjoint sets B_1, B_2, ..., B_n s.t. their union equals A.

Theorem. The set of equivalence classes of A under an equivalence relation \sim is a partition.

Definition. We call the set of equivalence classes of a set A under an equivalence relation \sim a quotient set, and denote it by A/\sim.

Why any (comprehensive) course on computer programming should cover basics of C: A story.

Good evening world!

Recently I came across a small task that reinforced my belief in the importance of C programming. The task was the following:

  1. Generate 500,000,000 pseudo-random numbers using Linear Congruential Generator algorithm [1, 2].
  2. Use Box-Mueller transform [3] to get normally distributed random variables (RV).
  3. Compute the next step in the evolution of a deterministic system, and add noise using the generated normally distributed RV.
  4. Write the result of every 1000th step to a file.

In other words, we need to run a for-loop for 500,000,000 steps, doing some calculations (generating RV + evaluating deterministic function), and writing to a file once in 1000 steps.

This doesn’t sound particularly challenging, and the whole thing can be done in less than 80 lines of C code. Same task can also be done in about 45 lines of Python. However, LOC is not the metric I want to look at here. I want to talk about performance of the code written, and some general educational caveats.

Is the field set level?

Let’s talk a tiny bit about optimization, under-optimization and over-optimization here.

The moment I shared this story with my friend, he immediately said that the performance comparison doesn’t make sense. However, the argument provided was the following: if the Python code takes that much longer to run, it clearly was not written well. I agree, when I was coming up with the comparison I was not using any fancy race-track wheels for Python. The entire script I wrote is as vanilla Python, as one can possibly imagine. Does this mean that I am cheating by employing “bad” or “inefficient” Python coding practices? I would say no, or at worst, just a little bit.

In both cases: C and Python, I wrote a vanilla implementation of the given task. Hence, no parallelism, no non-standard libraries, no pre-compiled/hardware optimized shenanigans. Did I manage to cheat a bit? Yes, of course I did, I compiled my C code with -O3 optimization flag. This of course is not the full story either. I did run my Python script naively invoking python ./generate.py rather than trying to compile it into optimized binary and then running it. However, for all of these “sins” I have a quite simple answer: I don’t do that with Python 99.9% of the time. I do not compile my Python scripts. I do not roam the web for pre-compiled computational libraries for Python. I do not tend to care that much about performance in the first place, when I code in Python.

How is this a conversation about optimization then? Well, I think we need to consider several parameters to be optimized, and then checkout what we get in terms of the relative performance. Hence, I will be thinking about 3 metrics here:

  1. Human minutes spent writing code (including debugging).
  2. Human minutes spent waiting for the program to finish running.
  3. sys runtime of the programs written.

In the context of these 3 parameters I can clearly define what I mean by optimizing, over-optimizing and under-optimizing performance of a task.

Over-optimizing: This is the case when I will spend a lot of time writing code that supposedly is great in terms of wall and sys times. Not surprisingly majority of the over-optimization in my case does not come from the assembly injections leveraging latest microarchitecture features. When I over-optimize with probability 0.9 it is due to me finding a paper proposing a fast algorithm that I am trying to write from scratch. Clearly this brings a caveat: asymptotically better performance, does not always translate into cycle-count performance on small enough examples.

Optimizing: Once in a blue moon, when working on a one-off personal project, I do hit the right spot. Just enough of complexity in the implementation to get a good average for the runtimes. Any properly optimized code should be optimized both in terms of human minutes spent writing it, and human minutes spent waiting for the results. However, as with anything in the world of computer programming, or life at large, there is a caveat: optimization is context dependent. Spending more development hours over code that has to be reused on a regular basis is worth it, as long as the eventual benefit in runtime pays for it.

Under-optimizing: This is what happens when the deadline is far away. Hacking together 25 lines of your favorite vanilla high-level language, and letting it run overnight, because you still have a week of time left, and one run only takes 14 hours. Surprisingly, I think that from a practical perspective this is more justified than over-optimizing. If I had to choose between code that takes 14 hours to run, but gets the job done, and code that takes 12 hours to develop and only 2 to run, I might go for the first one, because at least I can sleep or read for those 12 hours of difference. However, the caveat here is simple: if you need to run the code more than once, the unoptimized runtime will cost you a lot.

Performance Analysis

I was compiling and running all code on my personal laptop. The specs are listed below.

MacBook Pro (Retina, 13-inch, Mid 2014)

  • 2.6 GHz Intel Core i5 (4278U)
  • 8 GB 1600 MHz DDR3 RAM

Runtimes measured with time utility.

Runtime comparison for the generator scripts.

As you can see all across the times, the performance differs drastically. This is by no means a shocking or unexpected result, but it matters for the rest of the discussion.

Pedagogical Discussion

This post ultimately is about teaching and learning, so let’s finally talk about why any comprehensive course[F1] on computer programming must cover some basics of C language.

First, C is a compiled language. While the intricacies of compiling as a process lie beyond the introduction level, the acknowledgement of compilation as a step in a lifecycle of a program is critical. Virtually anything that has to do with computer programming in its broadest definitions can benefit from a better understanding of the full picture. As an example I can bring up a recent workplace story, where as we discovered certain business logic scripts where ultimately compiled into SQL statements. When the underlying tables changed, SQL statements became invalid, while the surface level logic remained perfectly sound. Thus, it took a bit of tinkering around to find out that in fact we had to trigger a re-compile for the SQL to become valid again. Hence, if you have a better knowledge of the full picture, then your bug fixing abilities are also better.

Second, C has great performance metrics. As the first part of this story shows, C does in fact yield quiet great performance in its vanilla form. Of course you have to be mindful of your project scope. In terms of over-optimization failures C is probably at the top of the list in close competition with C++. Just think of all the linked list implementations ever written in C. Now, think of all double linked list implementations, and FFT implementations, and Dijkstra’s algorithm implementations, and so on ad nauseam. Writing code in C oftentimes feels like re-inventing the wheel. In fair part because it is. However, when the task at hand boils down to arguably simple arithmetic operations that need to performed at medium scale, writing it up in C is probably the best bet.

Third, C is ubiquitous (unless you are on Windows). If you have *nix system it comes with either gcc or clang or some other form of C compiler. No need to download a myriad of heavy IDEs and weird things. To be fair the same can be said about vanilla Python, which in part is why I love it so much (but still use Anaconda).

Fourth, C builds discipline and rigor. I am not talking about painstaking debugging of memory leaks and segfaults. I am not talking about arcane magic of pointer arithmetic. Those things are clearly important, but I am talking about very very basic C. You need to declare variable types. You need to allocate stuff ahead of time. You need to keep track of what moving parts you’ve got in the game. These things amount to cleaner code and better style. You have to at least minimally organize your thoughts before writing in C. Hopefully, that generalizes to the same concept for all of the code you will write.

Finally, C is just another programming language. I firmly believe that breadth of programming knowledge is equally if not more important than depth for about 80% of people who will ever write code. In the job market it is hard to guarantee that your amazing knowledge of C++ will always be equally demanded. In the academia you might realize that the lab you just joined does things in Perl. You can probably still write half of your code in Java, but then you need to interface to the rest of the Perl codebase… You get the general idea. On the other hand, “the jack of all trades, but master of none” kind of programmer will be more likely to pick up a new language from the docs, because it is needed. In this regard C serves as a good training ground for general programming principles.

Hence, in the end we have a performant language that exposes you to some fundamental programming concepts and builds up a better coding discipline.

References

  1. S. K. Park, K. W. Miller. Random number generators: good ones are hard to find. Comm. ACM (31):10, pp. 1192-1201, 1988. [DOI: 10.1145/63039.63042]
  2. P. A. W. Lewis, A. S. Goodman, J. M. Miller. A pseudo-random number generator for the System/360. IBM Systems Journal (8):2, pp. 136-146, 1969. [DOI:10.1147/sj.82.0136]
  3. G. E. P. Box, M. E. Muller. A Note on the Generation of Random Normal Deviates. Ann. Math. Statist. (29):2, pp. 610-611, 1958. [DOI: 10.1214/aoms/1177706645]

Footnotes

[F1] By comprehensive course I mean an academic to a calendar year long introductory sequence on computer programming. Examples would include any “Intro to Computer Science… n = 1, 2, 3, …” sequences, and any bootcamps that aim to teach you computer programming. I do agree that there are shorter courses that clearly cannot cover learning C. However, I would also argue that such courses by no means are comprehensive.

Biophysical Society Annual Meeting 2019

Evening world! ( I have rewritten this 5 times at this point)

My visit to the BPS AM’19 was very short, but nevertheless I am happy that I had the opportunity to travel to this conference. Those few days have been eventful, and I have met and talked with multiple amazing people . Below I would like to share some of my thoughts on the talks, presentations and posters I have seen and engaged with. Since I have left early Monday morning, I cannot comment on the most of the content presented, so take this as a short snippet of my thoughts and not a comprehensive review.

Style

I firmly believe that style and presentation are extremely important in scientific world, especially during conferences. Good content organization and clear, simple to follow figures greatly improve audience’s ability to engage with material. I also think that organizing your content in a parsable way is the minimal courtesy you owe your readers.

I will start with some positive notes and remarks. First of all, black backgrounds in presentations are great! Of course there is a caveat of choosing appropriate color schemes for your figures (which I bet will deter people from going this route, since who wants to spend time making two sets of figures: for your presentation and for the paper), but when done well black background makes the content shine. The few presentations that had black background at the AM, warmed my heart.

Next, I need to applaud the usage of space on the slides. I haven’t seen a single overcrowded slide (maybe I just got lucky). Spreading out plots and figures so that they are readable is absolutely essential. Space usage is something I’ll come back to when talking about posters (including mine).

Now, let me move onto my incoherent rambling about style. I am in the process of learning Beamer, and it is definitely on the list of things I plan to master by the end of the year. It seems that the usage of LaTeX and LaTeX based tools and packages at this conference is almost exclusive to people coming in from the theory angle. The only talk I have seen for which slides were coming in the Beamer generated fashion was a theory talk on the inverse temperature dependence. I am not too surprised that this is the case, for two reasons. On one hand, majority of the CS, Math, Physics theory people would learn and use LaTeX anyway, which makes copy pasting and rearranging stuff easier. On the other hand, picking up LaTeX just for the presentations, especially if they are not peppered with formulas all the way through, seems like an overkill. I do hold a firm belief that people working in STEM fields should have some basic LaTeX skill as a part of the undergrad experience or early professional, but I would not go as far as to claim that we need to build everything in LaTeX. This is mostly an observation, as I reserve my LaTeX evangelism for a different post and time.

One thing that I do not grasp at all is why in 2019 we still have Comic Sans being used in presentations. Honestly, there is a plenty of great fonts that can catch people’s eyes if you are going for catchy. I do not share the ridiculous disgust towards Comic Sans, but even if we look at the original design specs it is supposed to be informal, children oriented font. Of course great research typed in Windings, will be ridiculous to read through, but that won’t change the fact that the underlying work is important. However, the font choices like this scream lack of professionalism and absolute style illiteracy to me. Just to reiterate this point, your style choices affect my perception of your attitude and skill, they do not affect my perception of your research work. Please, stop the Comic Sans abuse at science conferences!

Whew, okay, now let’s move to posters. First of all, while still far from perfect, I am fairly satisfied with the way my poster turned out. My main concerns are the crowded middle region where I had to have partially overlapping plots, and some not properly cleared up VMD renders that I simply did not have enough time to redo due to the data availability. Modulo those nuances I was fairly happy with the color scheme and organization, including some accidental design features. Coloring both the protein renders from VMD, and the lines in the plots with the same color convention was definitely a good move, as it allowed for seamless parsing of the plots, and saved much needed space by reducing number of legends I had to carry. An accidental feature was positioning the initial question with the illustrations at roughly same horizontal level as the conclusions. Since both contain a very similar illustration, it was easy to visually grasp what is that thing that changed over the course of our work. I am not fully pleased with some of the text in the poster, and I definitely would want to have another 2-3 days of re-reading and editing it, but alas.

The vast majority of the posters I saw at the conference were great. Clear organization, good posting sentences, and clear figures that stood out from the distance. However, there were a few common issues that I have noticed, predominantly in undergraduate posters. First is simple lack of content, or rather overabundance of white space. I have seen one too many posters with 2 or 3 figures, 3-4 paragraphs of text, and a canvas large enough to fit at least 3 more figures. I assume that this problem partially stems from the poster size expectation. Hey, we all think of 36″ x 24″ poster, and how this is a standard. However, I’d rather see more 24″ x 18″ posters that are content packed than 36″ x 24″ full of wide margins. Worst thing is that sometimes a poster with huge margins will still overcrowd the plots in the middle of itself. No bueno. Spacing out and arranging stuff is a pain in the butt, I know this myself. Plots come in weird proportions, plotting software is a monster to tame, and things tend to look different on your screen vs matte letter print vs glossy/matte poster print. However, the grind against just going with “Egh, good enough” is essential step in improvement of style. My work on this poster taught me a good principle, that my PI learned from his PI, so let me pass it onto you.

Make your poster, then print it out on Letter/A4 paper. If you can read the main figures clearly, then it’s a good font and line size, and proper spacing. If you are having trouble making sense of it on this print, go back to the drawing board.

NS paraphrasing Esmael Haddadian paraphrasing someone else…

On the other hand, I was impressed by some of the very creative approaches to poster presentations. Namely, the poster on Structure of a Non-canonical Middle Domain in Protocadherin-15 by Brandon L. Neel had an interesting twist to it. In addition to the regular static imagery, there also was a tablet set up, so that the interested people could watch MD trajectories evolve over time. I think that with the technology we have at hand right now, people who go extra mile to get their point across and present data in new and informative ways are the people with the higher potential for an impact.

On this note, I will wrap up my discussion of the style and presentation choices and do a quick dip into some of the research related moments and ideas.

Research

An obligatory, I am not a biophysicist by training. I am a computer scientist, I code things, analyze how fast they run, and think about how math models can help us solve real world problems. Thus, the rant below is not necessarily driven by deep professional concerns from a biophysics angle, but rather by some common sense scientific practices and beliefs.

I have attended some talks by scientists from D. E. Shaw research group, and while the content was impressive, I do feel deeply troubled by the messaging and approaches. Let me elaborate a little bit. Science by a large degree consists of trial and error. Especially repeated trial and repeated error. This applies to both experimental and simulation based results. If you were to measure or simulate something once, you would not publish a paper claiming it as a result. (Well sure in some rare cases you would, but that’s a more involved discussion.) Repeatability and multiple data points are essential to any kind of result that you are willing to cite as evidence for your hypothesis. Now, let us take this one step further. If I say that I did 700 independent experiments, and 95% of them yield a certain result with error margin of epsilon, you probably will agree that this sounds like a solid dataset. However, what if I also tell you that at this point in time no other research group can replicate my results? Sure, that also happens, sometimes the machinery is too complex and in many senses unique to the experiment. I mean after all both LHC and LIGO are not easy to come by at every other lab. On the other hand, those are large projects open to multiple collaborators within the field. They are not exactly a proprietary system with limited access given only to the employees of a particular private company. I think you see where I am going with this.

Work done by D. E. Shaw group is marvelous. However, I do have an issue with irreproducibility of the results. Say someone really wants to push for a certain drug affecting catalytic cycle of a protein involved in a disease. Assume that someone has ability to simulate the said catalytic cycle at high resolution with appropriate timescale. Now, what if someone accidentally messes up the parameters of their simulation? We get a perfect proof of effectiveness backed up by 500+ simulations, which then turns into active drug development. If 3 years down the line we discover that this in fact is bogus and doesn’t work, well we have wasted 3 years. If we don’t then we have a new drug, which might still end up being not quite as good as promised.

This is a long stretch hypothetical, but it has some ground behind it. Datasets have been cooked before (the example I am thinking of is a very particular test/train split in a study on the effects of certain drug combinations in cancer treatment), especially when money is on the line, and when pharmaceutical companies are in play, there always will be money on the line. Thus, while I hope that the researchers are keeping level heads and cold judgement, I cannot simply dismiss the concern about exclusivity of the technology that makes the MD research at D. E. Shaw beyond the cutting edge.

I believe that we should strive for repeatability and cross-validation in science. There always will be obstacles on this path, and I do believe in intellectual property rights, patents and related concepts. However, I think that the balance shall be maintained, and tools could and should be shared. After all making the first hammer is an invention, but once we have one it’s about who can do the most with it.

Concluding remarks

I started writing this almost a month ago, so I am not sure how connected those pieces feel. This was a long ramble about random stuff, so critique away at it.

Cheers, ns.

Baltimore, 02/28 – 03/03

Good afternoon world,

I am traveling to Baltimore, MD this weekend for Biophysical Society Annual Meeting. I will be presenting my research on the insulin degrading enzyme (IDE) as a part of the poster session on Sunday at 1:45pm. Feel free to stop by and say hi, and/or inquire more about my research. [doi: 10.1016/j.bpj.2018.11.301].

Hopefully, I will also have some free time to roam around the conference and the city.

I will be presenting tomorrow in the Exhibit Hall C, stand B13. Drop by to say and chat about research!

Sets

General definitions

From Wikipedia:

In mathematics, a set is a collection of distinct objects, considered as an object in its own right. For example, the numbers 2, 4, and 6 are distinct objects when considered separately, but when they are considered collectively they form a single set of size three, written {2, 4, 6}. The concept of a set is one of the most fundamental in mathematics. Developed at the end of the 19th century, set theory is now a ubiquitous part of mathematics, and can be used as a foundation from which nearly all of mathematics can be derived. In mathematics education, elementary topics from set theory such as Venn diagrams are taught at a young age, while more advanced concepts are taught as part of a university degree.

To break this down into simpler terms there are two important aspects of what constitutes a set:

  1. A set is a collection of distinct objects.
  2. A set itself constitutes an object, i.e. we can think of it as a tangible collection.

An example of a set can be pizza offerings at Giordano’s (a pizzeria in Chicago). This set contains distinct elements: Pepperoni pizza, Supreme pizza, Goat cheese and spinach pizza, Italian sausage pizza, Margherita pizza; and is in itself an object: a pizza menu.

The code below illustrates how we can declare a set in Python.

my_set = {0,3,4,0,7,9,13,35,0}
print(my_set)
{0, 3, 4, 35, 7, 9, 13}

We can see that in fact, even if we declared some non-distinct (i.e. repeated) elements, the set doesn’t contain them, as evidenced by the print() function.

Set Membership and Subsets

Given an object and a set we can test whether this object belongs to the given set. This is a check for set membership. We can also verify if an object does not belong to a set.

Given a set $A$ and an object $x$, we use the notation $x\in A$ to denote that $x$ is an element of $A$. We also use notation $x\notin A$ to denote that $x$ is not an element of $A$.

The code below illustrates how we can test these conditions in Python.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
print(3 in A)                         # Will print True, because 3 is an element of A.
print(6 in A)                         # Will print False, because 6 is not an element of A.
print(7 not in A)                     # Will print False, because 7 is an element of A.    (note the use of "not")
print(8 not in A)                     # Will print True, because 8 is not an element of A. (note the use of "not")
True
False
False
True

Another important relation is that of being a subset. If membership is a relation between an object and a set, then being a subset is a relation between two sets. Namely we say that $B$ is a subset of $A$, denoted $B\subseteq A$, if every element of $B$ is also an element of $A$. We also will say that $B$ is a proper subset of $A$, denoted $B\subset A$, if every element of $B$ is also an element of $A$, but there are elements in $A$ that are not in $B$.

The code below illustrates how we can test these relations in Python, and provides some examples of subsets and proper subsets.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 3, 5, 7}                      # B is the set of prime numbers less than 10.

print(B.issubset(A))                  # Check if B is a subset of A. Will print True.
print(B <= A)                         # Check if B is a subset of A. Will print True.
print(B < A)                          # Check if B is a proper subset of A. Will print True,
                                      # since all elements of B are in A (subset condition),
                                      # but 11 is in A, and not in B (proper condition).
        
print(A.issubset(A))                  # Check if A is a subset of A. Will print True. 
print(A <= A)                         # Check if A is a subset of A. Will print True.
print(A < A)                          # Check if A is a proper subset of A. Will print False,
                                      # since all elements of A are in A.
                                      # Note: a set is always a subset of itself.
True
True
True
True
True
False

Set Operations

Now, let us take a look at some common set operations. As many things in mathematics, these concepts can become more natural if visualized. Hence, let us briefly introduce the idea of Venn diagrams.

A Venn diagram is a schematic representation of a set and its possible relations with other sets. We usually will use (possibly misshapen) circles to denote the “set” and colors or the elements itself to denote the elements of this set. The few examples below will illustrate this idea.

Venn diagram of letters

Venn's four ellipses diagram

Set Union

The first set operation we will look at is set union. We can think of it as addition for the sets. The result of a set union is the set containing elements that appear in either of the sets. The following Venn diagram shows in red the union of sets $A$ and $B$, denoted $A\cup B$.

A union B

We can compute a union of two sets in Python by using the union method or by using | operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A | B
print(C)
{2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}
A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.union(B)
print(C)
{2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}
# Note that any possible "overlapping" elements will be only accounted for once,
# and thus the result will be a set (elements are distinct). This can be seen from
# the Venn diagram (the intersecting region is covered once) and from the example
# below.
A = {4, 5, 6, 7, 8, 9}
B = {6, 7, 8, 9, 10, 11}
C = A | B
print(C)
{4, 5, 6, 7, 8, 9, 10, 11}

Set Intersection

The next operation is set intersection. A set intersection is a set (possibly an empty one) that contains elements that appear in both sets. In other words, intersection is the overlap of the original sets. The following diagram shows the intersection of sets $A$ and $B$, denoted $A\cap B$.

A intersection B

We can compute an intersection of two sets in Python by using the intersection method or by using & operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A & B
print(C)
{2}
A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.intersection(B)
print(C)
{2}
# Only the elements present in BOTH sets get into the intersection. Thus in some
# cases the intersection can be empty. A Venn diagram for thsi case would be two
# non-overlapping circles.
A = {4, 5, 6, 7, 8, 9}
B = {10, 11, 12, 13, 14}
C = A & B
print(C)
set()

Set Difference

Next operation we will look at is the set difference. It is useful to know which elements belong to one set, but not the other. The set difference is a set that contains elements from the first set, but not the second one. The following diagram shows the difference of sets $A$ and $B$, denoted $A – B$ or $A\setminus B$.

A difference B

We can compute a difference between two sets in Python by using the difference method or by using - operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A - B
print(C)
{3, 5, 7, 11, 13, 17, 19}
A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.difference(B)
print(C)
{3, 5, 7, 11, 13, 17, 19}
# Note that just like the difference of two numbers depends on the order, the difference
# of two sets also depends on which one we want to subtract from. The example below
# illustrates this idea.
A = {4, 5, 6, 7, 8, 9}
B = {7, 8, 9, 10, 11}
C = A - B
D = B - A
print("A - B is {}".format(C))
print("B - A is {}".format(D))
A - B is {4, 5, 6}
B - A is {10, 11}

Set Symmetric Difference

The last set operation we will talk about is the symmetric difference. There are several ways you can think about the symmetric difference, but all of them encapsulate the same idea. We want to have a set that has elements that appear in either $A$ or $B$, but not in the both sets. Using the notation defined above we can write this as $(A\cup B) – (A\cap B)$ (the union/sum of the sets minus their intersection) or alternatively as $(A – B) \cup (B – A)$ (the $A$ without $B$ union $B$ without $A$). The following diagram shows the symmetric difference of sets $A$ and $B$, denoted $A \Delta B$.

A symmetric difference B

We can compute the symmetric difference between two sets in Python by using the symmetric_difference method or by using ^ operation on sets. The code below illustrates this.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A ^ B
print(C)
{3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}
A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = A.symmetric_difference(B)
print(C)
{3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19}
# The symmetric difference, unlike the regular set difference is symmetric. 
# Which means that the order of sets does not matter, the result will be
# the same, as illustrated by code below.
A = {4, 5, 6, 7, 8, 9}
B = {6, 7, 8, 9, 10, 11}
C = A ^ B
D = B ^ A
print(C)
print(D)
{4, 5, 10, 11}
{4, 5, 10, 11}

Additional remarks

Just like arithmetic operations are defined using two numbers, but can be extended to lengthier expressions, the set operation can be applied to multiple sets. In Python the easiest way to encapsulate this concept is using the set operations |, &, -, ^ and the appropriate () to group those operations. Examples below illustrate this idea.

A = {2, 3, 5, 7, 11, 13, 17, 19}      # A is the set of prime numbers less than 20.
B = {2, 4, 6, 8, 10, 12, 14, 16, 18}  # B is the set of even numbers > 0 and < 20.
C = {3, 5, 7, 9, 11, 13, 15, 17, 19}  # C is the set of odd numbers > 1 and < 20.
D = {3, 6, 9, 12, 15, 18}             # D is the set of numbers divisble by 3, > 0 and < 20. 
# This set is the union of C and D minus A. 
# Effectively it will contain numbers divisible by 3 or odd, that lie between 1 and 20,
# but will not contain the numbers that are prime.
E = (C | D) - A 
print(E)
{6, 9, 12, 15, 18}
# This set is a symmetric difference of A, B and difference between
# C and D. 
# Effectively it will contain numbers that are either prime, or even, or
# odd, but not divisible by 3. However, it will not contain numbers that satisfy
# more than two of those conditions at the same time (i.e. it won't contain 11,
# since it is both prime and not divisible by 3).
F = A ^ B ^ (C - D)
print(F)
{3, 4, 6, 8, 10, 12, 14, 16, 18}

Use Cases

We went through the trouble of learning the definitions for basic set arithmetic (the Operations section) and membership and subset relations (the Membership and Subsets section), so now is a good time to present some use cases for these structures and operations.

Besides being an essential building block in modern mathematics, sets often present a highly convenient data structure in programming. The examples below will guide you through some useful applications of sets in programming. Some of these examples are inspired by real production code.

Filtering down unwieldy lists

Sometimes we are faced with the problem of filtering a rather large list to only show unique values. A few common examples include the following:

  1. Identifying the unique caller IDs for a large list of phone calls.
  2. Identifying categories of the items carried by a store from the full inventory list.

Below we will address both of the problems by leveraging the property that a set contains distinct elements, and hence will effectively filter out only the unique elements.

# Problem 1.
# ----------
# Write a function that takes in a list of phone numbers (as strings),
# and returns a list containing the unique phone numbers from the original
# list.
#
# Input: list of phone numbers.
# 
# Output: list of unique phone numbers.
def phone_id_unique(numbers):
    unique_numbers_set = set(numbers)
    unique_numbers_list = list(unique_numbers_set)
    return unique_numbers_list
# Problem 1.
# ----------
# Tests:
#
# 1. Input:  ["800-000-0000" repeated 1 000 000 times]
#    Output: ["800-000-0000"]
test_input = ["800-000-0000"] * 1000000
print(phone_id_unique(test_input))

# 2. Input:  ["800-100-0000" repeated 1 000 000 times, "800-200-0000" repeated 1 000 000 times, ...,
#             "800-900-0000" repeated 1 000 000 times]
#    Output: ["800-000-0000", "800-100-0000", ..., "800-900-0000"]
test_input = []
for i in range(1, 10):
    test_input = test_input + ["800-{}00-0000".format(i)] * 1000000
print(phone_id_unique(test_input))

# 3. Input: ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
#    Output: ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
test_input = ["800-000-0000", "800-010-0000", "800-020-0000", "800-030-0000"]
print(phone_id_unique(test_input))
['800-000-0000']
['800-200-0000', '800-500-0000', '800-700-0000', '800-300-0000', '800-600-0000', '800-400-0000', '800-800-0000', '800-100-0000', '800-900-0000']
['800-020-0000', '800-000-0000', '800-010-0000', '800-030-0000']
# Problem 2.
# ----------
# Write a function that takes in a list of store carried product (as dictioanries),
# and returns a list containing the product categories that appear in the original
# list.
#
# Input: list of items.
# 
# Output: list of product categories.
def product_categories(items):
    categories_set = set([item["category"] for item in items])
    categories = list(categories_set)
    return categories
# Before testing we will load some data from .csv files. These files should be put into
# the same directory as the notebook. CSV stands for comma-separated values, and is a 
# common standard for representing data in text format.
items = []

import csv
with open("data_produce.csv", "r") as f:
    reader = csv.reader(f, delimiter = ",")
    for line in reader:
        items.append({"id": line[0], 
                      "category": line[1],
                      "stock": line[2],
                      "price": line[3]})

# Tests:
#
# 1. Input:  [1000000 items from 8 categories]
#    Output: ["perishables", "water", "kitchen", "furniture", "electronics", "paper", "pantry", "misc"]
print(product_categories(items))
[' pantry', ' furniture', ' water', ' kitchen', ' perishables', ' paper', ' electronics', ' misc']

Implementing common logical operations

Mathematical logic and set arithmetic are tightly connected. This allows us to use set arithmetic to model common logical operations, which in turn can easily encapsulate some everyday tasks we want to perform with out data.

Set union is analogous to logical OR operation, set intersection to logical AND, and the symmetric difference is analogous to logical XOR (exclusive OR) operation. Thus, we can use these operations to translate common tasks into set operations. Let us look at some of the examples below.

# Problem 3.
# ----------
# Write a function that takes in a set of items on mom's shopping list,
# a set of items on dad's shopping list, a set of items already bought by
# mom, a set of items already bought by dad, and finally a set of items
# that are currently in the fridge. The output should be a consolidated
# shopping list, i.e. it should only include the items that are not in the
# fridge and are not yet bought. 
#
# Input: 5 sets of items as described above.
# 
# Output: list of items that need to be procured.
def shopping_list_cons(mom_to_buy, dad_to_buy,
                       mom_bought, dad_bought,
                       in_fridge):
    all_to_buy = (mom_to_buy | dad_to_buy)
    all_bought = (mom_bought | dad_bought)
    to_buy = list(all_to_buy - all_bought - in_fridge)
    return to_buy
# Problem 3.
# ----------
# Tests:
#
# 1. Input:  mom_to_buy = {"apples", "candy", "chicken", "beef"}
#            dad_to_buy = {"candy", "beef", "bread", "cola"}
#            mom_bought = {"bread", "biscuits"}
#            dad_bought = {"milk", "coffee"}
#            in_fridge  = {"eggs", "chicken"}
#    Output: ["apples", "candy", "beef", "cola"]
mom_to_buy = {"apples", "candy", "chicken", "beef"}
dad_to_buy = {"candy", "beef", "bread", "cola"}
mom_bought = {"bread", "biscuits"}
dad_bought = {"milk", "coffee"}
in_fridge  = {"eggs", "chicken"}
print(shopping_list_cons(mom_to_buy, dad_to_buy, mom_bought, dad_bought, in_fridge))
['apples', 'candy', 'beef', 'cola']

Here we take advantage of some set operations to solve the problem. First, we consolidate both “to buy” lists taking their union, thus ensuring that all items are accounted for and none are double counted. Then we consolidate the already bought items by taking another union. Finally, we take the difference between what we need to buy and what is already bought or already in the fridge.

In the next problem will take a look at some applications of the symmetric difference.

# Problem 4.
# ----------
# Students at Chicken Soup High-school are offered two
# options for Calculus classes. There is an "Intro to Calculus"
# class and "Calculus" class. Some students take only 
# the first class through their time in high-school,
# some only take the second class, by placing out of the
# first one, and finally some students take both classes
# as a sequence.
# At the end of each year, average for these classes performance
# is computed to evaluate effectiveness of instructors. 
# The average is computed according to a strange formula,
# because statistics and performance department of 
# Chicken Soup high loves hard to understand numbers.
# You are provided with the formula and the list of 
# students in each class and their grades. To protect
# students' privacy you are given unique StudentIDs.
# 
# AVG = (avg grade for students who only took "Intro") +
#     + (avg grade for students who only took "Calculus") +
#     + 1.75 * (avg grade for students who took both)
#
# Write a function that takes in two dictionaries of 
# StudentIDs and grades and computes the average according
# to the given formula.
#
# Input: a dictionary for students who took "Intro to Calculus",
#        a dictionary for students who took "Calculus".
# 
# Output: the average grade.
import numpy as np
def chicken_soup_high_avg(intro, calc):
    one_course = set(intro.keys()) ^ set(calc.keys())
    both_courses = set(intro.keys()) & set(calc.keys())
    one_grades = []
    both_grades = []
    for studentID in one_course:
        try:
            one_grades.append(intro[studentID])
        except:
            pass
        
        try:
            one_grades.append(calc[studentID])
        except:
            pass
    
    for studentID in both_courses:
        both_grades.append(intro[studentID])
        both_grades.append(calc[studentID])
        
    avg = np.mean(one_grades) + 1.75 * np.mean(both_grades)
    
    return avg
# Problem 4.
# ----------
# Tests:
#
# 1. Input: {"1": 4.0, "2": 3.75, "3": 3.4},
#           {"2": 3.8, "3": 3.0, "4": 4.0}
#    Output: 10.10
print("{:.2f}".format(chicken_soup_high_avg({"1": 4.0, "2": 3.75, "3": 3.4},
                                            {"2": 3.8, "3": 3.0, "4": 4.0})))
10.10

Further remarks

While not strictly necessary, sets can make several classic graph algorithms easier to write and explain. We will cover those in the graphs section.

Downloads